Add Nearest Neighbor Distance Ratio and Epsilon Identifiability Privacy Metrics #42

emersodb · 2025-09-23T14:28:25Z

PR Type

Feature

Short Description

Clickup Ticket(s): Link(s) if applicable.

This PR integrates the last two metrics associated with the integration of SynthEval metrics into the library. These are the Nearest Neighbor Distance Ratio and Epsilon Identifiability privacy metrics.

NNDR considers the average ratio between the closest and next closet point in a real data set for each point in the synthetic dataset. (closer to 1 is better).
EIR computes the percentage of points in a real dataset that are closer to a synthetic data point than another real data point in the same dataset. (closer to 0 is better).

NOTE: The SynthEval implementation is flawed. Rather than computing the NNDR for synthetic data points to real data points, it computes the NNDR of real data points to synthetic datapoints. This is not what you want to do in order to measure privacy (see https://arxiv.org/pdf/2501.03941). The second computation is correct if you want to do membership inference (i.e. as Sara and Fatemeh are working on).

Tests Added

Added a fair number of tests to verify the computations.

emersodb · 2025-09-23T14:30:01Z

src/midst_toolkit/evaluation/privacy/distance_closest_record.py

 DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")


-@overload


Moving this to it's own module, as it is useful for computing NNDR as well.

emersodb · 2025-09-23T14:34:54Z

src/midst_toolkit/evaluation/privacy/distance_utils.py

        ``target_data.``
    """
    if batch_size is None:
        # If batch size isn't specified, do it all at once.


Small bug (didn't make code wrong, just bad batch size estimate).

emersodb · 2025-09-23T14:48:11Z

src/midst_toolkit/evaluation/utils.py

-        numerical_column_idx = numerical_column_idx + target_col_idx
-    else:
-        categorical_column_idx = categorical_column_idx + target_col_idx
+    if "target_col_idx" in meta_info:


Loosening this preprocessing to admit settings where no target column exists.

…ection

emersodb · 2025-09-24T16:19:31Z

tests/unit/data_processing/test_utils.py

 def test_get_categorical_columns() -> None:
    # Low threshold
    categorical_columns = get_categorical_columns(TEST_DATAFRAME, 2)
+    # Note that this does not include the date time column, as it isn't a categorical, as the detection algorithm


The get_categorical columns functionality here is a convenience function and it's unclear how a user would want to treat data time objects. So we're sort of punting here. Ideally a user would have done "something" to make their datatime objects numerical or categorical (where date time objects can general exists on a spectrum of these).

bzamanlooy

I just have a few comments that might help improve the readability of the documentation. The code itself looks good to me :)

src/midst_toolkit/evaluation/privacy/distance_utils.py

src/midst_toolkit/evaluation/privacy/nearest_neighbor_distance_ratio.py

src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py

lotif

Approved with small comments.

lotif · 2025-09-29T19:26:16Z

src/midst_toolkit/evaluation/privacy/distance_closest_record.py

-        some way. This can be done via the ``preprocess`` function beforehand or it can be done within compute if
-        ``do_preprocess`` is True and ``meta_info`` has been provided.
+        some way. This can be done via the ``preprocess`` function in ``distance_preprocess.py`` beforehand or it can
+        be done within compute if ``do_preprocess`` is True and ``meta_info`` has been provided.


compute is a function, right? It should be wrapped in "``" if so.

lotif · 2025-09-29T19:28:39Z

src/midst_toolkit/evaluation/privacy/distance_preprocess.py

+
+
+@overload
+def preprocess(


can those functions have better names? preprocess is super generic, maybe something like preprocess_for_distance_calulation?

Yeah that's fair. Will change

lotif · 2025-09-29T19:42:21Z

src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py

+        holdout set represents real data that was NOT.
+
+        NOTE: Columns are not uniformly weighted. They are weighted by their inverse column entropy to provide
+        greater attention to rare data points. This is formally defined in


Looks like it's missing something here... maybe a : or This is formally defined by the paper below:.

lotif · 2025-09-29T19:45:32Z

src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py

+            filtered_real_data = real_data[self.numerical_columns]
+            filtered_synthetic_data = synthetic_data[self.numerical_columns]
+            filtered_holdout_data = holdout_data[self.numerical_columns] if holdout_data is not None else None
+        else:


This is a bit dangerous because if we ever add another EpsilonIdentifiabilityNorm element it will silently fall in here if we forget to modify this. Changing this to elif self.norm == EpsilonIdentifiabilityNorm.GOWER will be better.

Good point. Changing now.

lotif · 2025-09-29T19:54:52Z

src/midst_toolkit/evaluation/privacy/nearest_neighbor_distance_ratio.py

+from midst_toolkit.evaluation.privacy.distance_utils import NormType, compute_top_k_distances
+
+
+DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")


Can we not have this as a module level variable? This variable is set every time this is imported by other modules and it's also set by other modules under the same name. Can we have this variable be defined in an utils module, or maybe be retruned by function, and then be reused here?

Yeah I think that's worthwhile. Will add a function I think.

lotif · 2025-09-29T19:57:59Z

src/midst_toolkit/evaluation/privacy/nearest_neighbor_distance_ratio.py

+            norm: Determines what norm the distances are computed in. Defaults to NormType.L2.
+            batch_size: Batch size used to compute the NNDR iteratively. Just needed to manage memory. Defaults to
+                1000.
+            device: What device the tensors should be sent to in order to perform the calculations. Defaults to DEVICE.


I'd add some more context here: Defaults to "cuda" if CUDA is available, "cpu" othwerwise.

lotif · 2025-09-29T19:59:20Z

src/midst_toolkit/evaluation/privacy/nearest_neighbor_distance_ratio.py

+                If None, then no preprocessing is expected to be done. Defaults to None.
+            do_preprocess: Whether or not to preprocess the dataframes before performing the NNDR calculations.
+                Preprocessing is performed with the ``preprocess`` function of ``distance_preprocess.py``. Defaults to
+                False.


We should mention here that meta_info should be provided if this is set to True.

emersodb added 3 commits September 22, 2025 13:07

NNDR module and tests

81a40ba

Fixing small bug

c5a0682

Adding in the epsilon identifiability risk metric

704e48f

emersodb requested review from amrit110, lotif, fatemetkl, sarakodeiri, masi-sh and bzamanlooy September 23, 2025 14:28

emersodb changed the base branch from main to dbe/add_f1_dff_hitting_rate September 23, 2025 14:28

emersodb commented Sep 23, 2025

View reviewed changes

Small code fixes and documentation improvements

884c582

emersodb marked this pull request as ready for review September 23, 2025 14:56

emersodb added 3 commits September 24, 2025 11:26

Merge branch 'dbe/add_f1_dff_hitting_rate' into dbe/add_nndr_and_eir

b985246

Some small updates

1b51d23

Adding in a bit more revealing testing for the categorical column det…

810f3d6

…ection

emersodb commented Sep 24, 2025

View reviewed changes

emersodb added 3 commits September 26, 2025 10:24

Merge branch 'dbe/add_f1_dff_hitting_rate' into dbe/add_nndr_and_eir

0147eb1

Merge branch 'dbe/add_f1_dff_hitting_rate' into dbe/add_nndr_and_eir

1aa4e5a

Merge branch 'dbe/add_f1_dff_hitting_rate' into dbe/add_nndr_and_eir

789dd7b

bzamanlooy reviewed Sep 29, 2025

View reviewed changes

Addressing some PR comments.

500ce2b

lotif approved these changes Sep 29, 2025

View reviewed changes

PR Comment changes

231265c

		DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")


		@overload

		from midst_toolkit.evaluation.privacy.distance_utils import NormType, compute_top_k_distances


		DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

Add Nearest Neighbor Distance Ratio and Epsilon Identifiability Privacy Metrics #42

Are you sure you want to change the base?

Add Nearest Neighbor Distance Ratio and Epsilon Identifiability Privacy Metrics #42

Uh oh!

Conversation

emersodb commented Sep 23, 2025

PR Type

Short Description

Tests Added

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emersodb Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bzamanlooy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lotif left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

emersodb Sep 24, 2025 •

edited

Loading