generated from VectorInstitute/aieng-template-uv
-
Notifications
You must be signed in to change notification settings - Fork 0
Add Nearest Neighbor Distance Ratio and Epsilon Identifiability Privacy Metrics #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
adae16e
First checkin of hellinger and pmse implementations
emersodb a790991
Fix typing issue
emersodb 43834a4
Adding in Hitting Rate and Mean F1 Difference implementations. Also f…
emersodb bec6418
Removing hard coding
emersodb fb3c48a
Some CR fixes from Marcelo's review
emersodb 1705d50
Merge branch 'main' into dbe/add_hellinger_pmse
emersodb c286906
Train code split, Part 2: moving some of the model.py code into clust…
lotif 072a09a
Merge branch 'dbe/add_hellinger_pmse' into dbe/add_f1_dff_hitting_rate
emersodb 81a40ba
NNDR module and tests
emersodb c5a0682
Fixing small bug
emersodb 704e48f
Adding in the epsilon identifiability risk metric
emersodb 884c582
Small code fixes and documentation improvements
emersodb 59ea7f4
New mypy flow and fixes to typing issues that were discovered
emersodb 660729f
Merge branch 'dbe/fixing_mypy' into dbe/add_hellinger_pmse
emersodb 4023d21
Merge branch 'dbe/add_hellinger_pmse' into dbe/add_f1_dff_hitting_rate
emersodb b985246
Merge branch 'dbe/add_f1_dff_hitting_rate' into dbe/add_nndr_and_eir
emersodb 1b51d23
Some small updates
emersodb 5bd360c
Merge branch 'main' into dbe/add_hellinger_pmse
emersodb 810f3d6
Adding in a bit more revealing testing for the categorical column det…
emersodb 8328547
Merge branch 'main' into dbe/add_hellinger_pmse
emersodb 2ccdfa6
Merge branch 'dbe/add_hellinger_pmse' into dbe/add_f1_dff_hitting_rate
emersodb 0147eb1
Merge branch 'dbe/add_f1_dff_hitting_rate' into dbe/add_nndr_and_eir
emersodb 2de4692
Merge branch 'main' into dbe/add_hellinger_pmse
emersodb c78973f
Merge branch 'dbe/add_hellinger_pmse' into dbe/add_f1_dff_hitting_rate
emersodb 1aa4e5a
Merge branch 'dbe/add_f1_dff_hitting_rate' into dbe/add_nndr_and_eir
emersodb 8918c35
Merge branch 'main' into dbe/add_hellinger_pmse
emersodb b2c5de5
Merge branch 'main' into dbe/add_hellinger_pmse
emersodb 3defbf2
Addressing some PR comments
emersodb 217c7a7
Merge branch 'dbe/add_hellinger_pmse' into dbe/add_f1_dff_hitting_rate
emersodb 789dd7b
Merge branch 'dbe/add_f1_dff_hitting_rate' into dbe/add_nndr_and_eir
emersodb 500ce2b
Addressing some PR comments.
emersodb 231265c
PR Comment changes
emersodb e41bcc7
Merge branch 'main' into dbe/add_nndr_and_eir
emersodb 6b9d639
Dropping unused variable
emersodb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
import torch | ||
|
||
|
||
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
122 changes: 122 additions & 0 deletions
122
src/midst_toolkit/evaluation/privacy/distance_preprocess.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
from typing import Any, overload | ||
|
||
import numpy as np | ||
import pandas as pd | ||
from sklearn.preprocessing import OneHotEncoder | ||
|
||
from midst_toolkit.evaluation.utils import extract_columns_based_on_meta_info | ||
|
||
|
||
@overload | ||
def preprocess_for_distance_computation( | ||
meta_info: dict[str, Any], synthetic_data: pd.DataFrame, real_data_train: pd.DataFrame | ||
) -> tuple[pd.DataFrame, pd.DataFrame]: ... | ||
|
||
|
||
@overload | ||
def preprocess_for_distance_computation( | ||
meta_info: dict[str, Any], | ||
synthetic_data: pd.DataFrame, | ||
real_data_train: pd.DataFrame, | ||
real_data_test: pd.DataFrame, | ||
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: ... | ||
|
||
|
||
def preprocess_for_distance_computation( | ||
meta_info: dict[str, Any], | ||
synthetic_data: pd.DataFrame, | ||
real_data_train: pd.DataFrame, | ||
real_data_test: pd.DataFrame | None = None, | ||
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame] | tuple[pd.DataFrame, pd.DataFrame]: | ||
""" | ||
This function performs preprocessing on Pandas dataframes to prepare for computation of various record-to-record | ||
distances. This is used for computations like distance to closest record scores. Specifically, this function | ||
filters the provided raw dataframes to the appropriate numerical and categorical columns based on the information | ||
of the ``meta_info`` JSON. For the numerical columns, it normalizes values by the distance between the largest | ||
and smallest value of each column of the ``real_data_train`` numerical values. The categorical columns are | ||
processed into one-hot encoding columns, where the transformation is fitted on the concatenation of columns from | ||
each dataset. | ||
|
||
Args: | ||
meta_info: JSON with meta information about the columns and their corresponding types that should be | ||
considered. | ||
synthetic_data: Dataframe containing all synthetically generated data. | ||
real_data_train: Dataframe containing the real training data associated with the model that generated the | ||
``synthetic_data``. | ||
real_data_test: Dataframe containing the real test data. It's important that this data was not seen by the | ||
model that generated ``synthetic_data`` during training. If None, then it will, of course, not be | ||
preprocessed. Defaults to None. | ||
|
||
Returns: | ||
Processed Pandas dataframes with the synthetic data, real data for training, real data for testing if it was | ||
provided. | ||
""" | ||
numerical_synthetic_data, categorical_synthetic_data = extract_columns_based_on_meta_info( | ||
synthetic_data, meta_info | ||
) | ||
numerical_real_data_train, categorical_real_data_train = extract_columns_based_on_meta_info( | ||
real_data_train, meta_info | ||
) | ||
|
||
numerical_ranges = [ | ||
numerical_real_data_train[index].max() - numerical_real_data_train[index].min() | ||
for index in numerical_real_data_train.columns | ||
] | ||
numerical_ranges_np = np.array(numerical_ranges) | ||
|
||
num_synthetic_data_np = numerical_synthetic_data.to_numpy() | ||
num_real_data_train_np = numerical_real_data_train.to_numpy() | ||
|
||
# Normalize the values of the numerical columns of the different datasets by the ranges of the train set. | ||
num_synthetic_data_np = num_synthetic_data_np / numerical_ranges_np | ||
num_real_data_train_np = num_real_data_train_np / numerical_ranges_np | ||
|
||
cat_synthetic_data_np = categorical_synthetic_data.to_numpy().astype("str") | ||
cat_real_data_train_np = categorical_real_data_train.to_numpy().astype("str") | ||
|
||
if real_data_test is not None: | ||
numerical_real_data_test, categorical_real_data_test = extract_columns_based_on_meta_info( | ||
real_data_test, meta_info | ||
) | ||
num_real_data_test_np = numerical_real_data_test.to_numpy() | ||
# Normalize the values of the numerical columns of the different datasets by the ranges of the train set. | ||
num_real_data_test_np = num_real_data_test_np / numerical_ranges_np | ||
cat_real_data_test_np = categorical_real_data_test.to_numpy().astype("str") | ||
else: | ||
num_real_data_test_np, cat_real_data_test_np = None, None | ||
|
||
if categorical_real_data_train.shape[1] > 0: | ||
encoder = OneHotEncoder() | ||
if cat_real_data_test_np is not None: | ||
encoder.fit(np.concatenate((cat_synthetic_data_np, cat_real_data_train_np, cat_real_data_test_np), axis=0)) | ||
else: | ||
encoder.fit(np.concatenate((cat_synthetic_data_np, cat_real_data_train_np), axis=0)) | ||
|
||
cat_synthetic_data_oh = encoder.transform(cat_synthetic_data_np).toarray() | ||
cat_real_data_train_oh = encoder.transform(cat_real_data_train_np).toarray() | ||
if cat_real_data_test_np is not None: | ||
cat_real_data_test_oh = encoder.transform(cat_real_data_test_np).toarray() | ||
|
||
else: | ||
cat_synthetic_data_oh = np.empty((categorical_synthetic_data.shape[0], 0)) | ||
cat_real_data_train_oh = np.empty((categorical_real_data_train.shape[0], 0)) | ||
if categorical_real_data_test is not None: | ||
cat_real_data_test_oh = np.empty((categorical_real_data_test.shape[0], 0)) | ||
|
||
processed_real_data_train = pd.DataFrame( | ||
np.concatenate((num_real_data_train_np, cat_real_data_train_oh), axis=1) | ||
).astype(float) | ||
processed_synthetic_data = pd.DataFrame( | ||
np.concatenate((num_synthetic_data_np, cat_synthetic_data_oh), axis=1) | ||
).astype(float) | ||
|
||
if real_data_test is None: | ||
return (processed_synthetic_data, processed_real_data_train) | ||
|
||
assert num_real_data_test_np is not None | ||
assert cat_real_data_test_oh is not None | ||
return ( | ||
processed_synthetic_data, | ||
processed_real_data_train, | ||
pd.DataFrame(np.concatenate((num_real_data_test_np, cat_real_data_test_oh), axis=1)).astype(float), | ||
) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this to it's own module, as it is useful for computing NNDR as well.