Skip to content

Conversation

emersodb
Copy link
Collaborator

PR Type

Feature

Short Description

Clickup Ticket(s): Link(s) if applicable.

This PR integrates the last two metrics associated with the integration of SynthEval metrics into the library. These are the Nearest Neighbor Distance Ratio and Epsilon Identifiability privacy metrics.

  • NNDR considers the average ratio between the closest and next closet point in a real data set for each point in the synthetic dataset. (closer to 1 is better).
  • EIR computes the percentage of points in a real dataset that are closer to a synthetic data point than another real data point in the same dataset. (closer to 0 is better).

NOTE: The SynthEval implementation is flawed. Rather than computing the NNDR for synthetic data points to real data points, it computes the NNDR of real data points to synthetic datapoints. This is not what you want to do in order to measure privacy (see https://arxiv.org/pdf/2501.03941). The second computation is correct if you want to do membership inference (i.e. as Sara and Fatemeh are working on).

Tests Added

Added a fair number of tests to verify the computations.

@emersodb emersodb changed the base branch from main to dbe/add_f1_dff_hitting_rate September 23, 2025 14:28
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")


@overload
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this to it's own module, as it is useful for computing NNDR as well.

``target_data.``
"""
if batch_size is None:
# If batch size isn't specified, do it all at once.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small bug (didn't make code wrong, just bad batch size estimate).

numerical_column_idx = numerical_column_idx + target_col_idx
else:
categorical_column_idx = categorical_column_idx + target_col_idx
if "target_col_idx" in meta_info:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loosening this preprocessing to admit settings where no target column exists.

@emersodb emersodb marked this pull request as ready for review September 23, 2025 14:56
def test_get_categorical_columns() -> None:
# Low threshold
categorical_columns = get_categorical_columns(TEST_DATAFRAME, 2)
# Note that this does not include the date time column, as it isn't a categorical, as the detection algorithm
Copy link
Collaborator Author

@emersodb emersodb Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_categorical columns functionality here is a convenience function and it's unclear how a user would want to treat data time objects. So we're sort of punting here. Ideally a user would have done "something" to make their datatime objects numerical or categorical (where date time objects can general exists on a spectrum of these).

Copy link
Collaborator

@bzamanlooy bzamanlooy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a few comments that might help improve the readability of the documentation. The code itself looks good to me :)

Copy link
Collaborator

@lotif lotif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with small comments.

some way. This can be done via the ``preprocess`` function beforehand or it can be done within compute if
``do_preprocess`` is True and ``meta_info`` has been provided.
some way. This can be done via the ``preprocess`` function in ``distance_preprocess.py`` beforehand or it can
be done within compute if ``do_preprocess`` is True and ``meta_info`` has been provided.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute is a function, right? It should be wrapped in "``" if so.



@overload
def preprocess(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can those functions have better names? preprocess is super generic, maybe something like preprocess_for_distance_calulation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fair. Will change

holdout set represents real data that was NOT.
NOTE: Columns are not uniformly weighted. They are weighted by their inverse column entropy to provide
greater attention to rare data points. This is formally defined in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's missing something here... maybe a : or This is formally defined by the paper below:.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

filtered_real_data = real_data[self.numerical_columns]
filtered_synthetic_data = synthetic_data[self.numerical_columns]
filtered_holdout_data = holdout_data[self.numerical_columns] if holdout_data is not None else None
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit dangerous because if we ever add another EpsilonIdentifiabilityNorm element it will silently fall in here if we forget to modify this. Changing this to elif self.norm == EpsilonIdentifiabilityNorm.GOWER will be better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Changing now.

from midst_toolkit.evaluation.privacy.distance_utils import NormType, compute_top_k_distances


DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not have this as a module level variable? This variable is set every time this is imported by other modules and it's also set by other modules under the same name. Can we have this variable be defined in an utils module, or maybe be retruned by function, and then be reused here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that's worthwhile. Will add a function I think.

norm: Determines what norm the distances are computed in. Defaults to NormType.L2.
batch_size: Batch size used to compute the NNDR iteratively. Just needed to manage memory. Defaults to
1000.
device: What device the tensors should be sent to in order to perform the calculations. Defaults to DEVICE.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add some more context here: Defaults to "cuda" if CUDA is available, "cpu" othwerwise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

If None, then no preprocessing is expected to be done. Defaults to None.
do_preprocess: Whether or not to preprocess the dataframes before performing the NNDR calculations.
Preprocessing is performed with the ``preprocess`` function of ``distance_preprocess.py``. Defaults to
False.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention here that meta_info should be provided if this is set to True.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants