-
Notifications
You must be signed in to change notification settings - Fork 0
Add Nearest Neighbor Distance Ratio and Epsilon Identifiability Privacy Metrics #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dbe/add_f1_dff_hitting_rate
Are you sure you want to change the base?
Conversation
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") | ||
|
||
|
||
@overload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this to it's own module, as it is useful for computing NNDR as well.
``target_data.`` | ||
""" | ||
if batch_size is None: | ||
# If batch size isn't specified, do it all at once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small bug (didn't make code wrong, just bad batch size estimate).
numerical_column_idx = numerical_column_idx + target_col_idx | ||
else: | ||
categorical_column_idx = categorical_column_idx + target_col_idx | ||
if "target_col_idx" in meta_info: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loosening this preprocessing to admit settings where no target column exists.
def test_get_categorical_columns() -> None: | ||
# Low threshold | ||
categorical_columns = get_categorical_columns(TEST_DATAFRAME, 2) | ||
# Note that this does not include the date time column, as it isn't a categorical, as the detection algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_categorical columns functionality here is a convenience function and it's unclear how a user would want to treat data time objects. So we're sort of punting here. Ideally a user would have done "something" to make their datatime objects numerical or categorical (where date time objects can general exists on a spectrum of these).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have a few comments that might help improve the readability of the documentation. The code itself looks good to me :)
src/midst_toolkit/evaluation/privacy/nearest_neighbor_distance_ratio.py
Outdated
Show resolved
Hide resolved
src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py
Outdated
Show resolved
Hide resolved
src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py
Outdated
Show resolved
Hide resolved
src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with small comments.
some way. This can be done via the ``preprocess`` function beforehand or it can be done within compute if | ||
``do_preprocess`` is True and ``meta_info`` has been provided. | ||
some way. This can be done via the ``preprocess`` function in ``distance_preprocess.py`` beforehand or it can | ||
be done within compute if ``do_preprocess`` is True and ``meta_info`` has been provided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compute is a function, right? It should be wrapped in "``" if so.
|
||
|
||
@overload | ||
def preprocess( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can those functions have better names? preprocess
is super generic, maybe something like preprocess_for_distance_calulation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's fair. Will change
holdout set represents real data that was NOT. | ||
NOTE: Columns are not uniformly weighted. They are weighted by their inverse column entropy to provide | ||
greater attention to rare data points. This is formally defined in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's missing something here... maybe a :
or This is formally defined by the paper below:
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
filtered_real_data = real_data[self.numerical_columns] | ||
filtered_synthetic_data = synthetic_data[self.numerical_columns] | ||
filtered_holdout_data = holdout_data[self.numerical_columns] if holdout_data is not None else None | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit dangerous because if we ever add another EpsilonIdentifiabilityNorm
element it will silently fall in here if we forget to modify this. Changing this to elif self.norm == EpsilonIdentifiabilityNorm.GOWER
will be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Changing now.
from midst_toolkit.evaluation.privacy.distance_utils import NormType, compute_top_k_distances | ||
|
||
|
||
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not have this as a module level variable? This variable is set every time this is imported by other modules and it's also set by other modules under the same name. Can we have this variable be defined in an utils module, or maybe be retruned by function, and then be reused here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that's worthwhile. Will add a function I think.
norm: Determines what norm the distances are computed in. Defaults to NormType.L2. | ||
batch_size: Batch size used to compute the NNDR iteratively. Just needed to manage memory. Defaults to | ||
1000. | ||
device: What device the tensors should be sent to in order to perform the calculations. Defaults to DEVICE. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add some more context here: Defaults to "cuda" if CUDA is available, "cpu" othwerwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
If None, then no preprocessing is expected to be done. Defaults to None. | ||
do_preprocess: Whether or not to preprocess the dataframes before performing the NNDR calculations. | ||
Preprocessing is performed with the ``preprocess`` function of ``distance_preprocess.py``. Defaults to | ||
False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention here that meta_info
should be provided if this is set to True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
PR Type
Feature
Short Description
Clickup Ticket(s): Link(s) if applicable.
This PR integrates the last two metrics associated with the integration of SynthEval metrics into the library. These are the Nearest Neighbor Distance Ratio and Epsilon Identifiability privacy metrics.
NOTE: The SynthEval implementation is flawed. Rather than computing the NNDR for synthetic data points to real data points, it computes the NNDR of real data points to synthetic datapoints. This is not what you want to do in order to measure privacy (see https://arxiv.org/pdf/2501.03941). The second computation is correct if you want to do membership inference (i.e. as Sara and Fatemeh are working on).
Tests Added
Added a fair number of tests to verify the computations.