-
Notifications
You must be signed in to change notification settings - Fork 0
Ensemble attack: Meta classifier pipeline #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* Add XGBoost training pipeline * Add XGBoost and LR pipeline * Metaclassifier training finalized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of comments, but mostly on small things :)
"categories": [ | ||
"0", | ||
"1", | ||
"2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation is inconsistent from this line until the end of the array.
3 | ||
], | ||
"file_type": "csv", | ||
"data_path": "/projects/aieng/midst_competition/data/tabddpm/tabddpm_1/train.csv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to point to a path in the cluster. We should change it to something else, maybe a relative path to this example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment no longer applies to this PR considering the recent updates. I have addressed this issue in my branch (upcoming PR), where we actually use this data_path.
@@ -0,0 +1,58 @@ | |||
{ | |||
"general": { | |||
"data_dir": "/projects/aieng/midst_competition/data/tabddpm/tabddpm_1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue with paths in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
CLAVADDPM_WHITE_BOX = "clavaddpm_white_box" | ||
|
||
|
||
def expand_ranges(ranges): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing type hints here.
|
||
def expand_ranges(ranges): | ||
""" | ||
Reads a list of tuples representing ranges and expands them into a flat list of integers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Receives
instead of Reads
here is more appropriate.
Returns: | ||
A dataframe with the new distance-based features, indexed like df_input. | ||
""" | ||
cat_features = [col in cat_cols for col in df_input.columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
categorical_features
instead of cat_features
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also col -> column
cat_cols: A list of categorical column names. | ||
Returns: | ||
A dataframe with the new distance-based features, indexed like df_input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also tell the structure of this dataframe, so the person using this for the first time doesn't have to inspect the code beforehand. For instance, mention it has the columns min_gower_distance
, nndr
etc. and what is supposed to be contained in them.
return pd.DataFrame(features, index=df_input.index) | ||
|
||
|
||
def calculate_domias(df_input: pd.DataFrame, df_synth: pd.DataFrame, df_ref: pd.DataFrame) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better name would be calculate_domias_scores
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, same comment here about spelling out parameter names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like all internal variables in this function could benefit from better names.
@@ -0,0 +1,24 @@ | |||
# Possible training utilities for ensemble attacks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it into a docstring, but remove Possible
.
that a data point is a member. | ||
:param max_fpr: threshold on the FPR. | ||
return: The TPR at `max_fpr` FPR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring format should be google style.
e93dc1e
to
f3c2aee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your tests are very extensive and I think this is a good start, but we definitely want to bring in certain elements of our conventions and documentation.
] | ||
|
||
|
||
# Training settings (placeholder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this will become clear later, but I'm not sure what we mean by "placeholder" here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the "Training" settings, I'm assuming the numbers we've assigned are placeholder numbers. But for the "Metaclassifier" settings, I'm not sure if I want to keep the "epoch" variable because it's not used in the current implementation. Couldn't think of a better name that would capture the temporary nature of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused as to what is happening here in that some of these files already exist in the ensemble_attack/
example folder. It seems like we're duplicating it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wait to review these files until after we get that sorted out 🙂
@@ -0,0 +1,141 @@ | |||
# Blending++ orchestrator, equivalent to blending_plus_plus.py in the submission repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would create a link to the repository specifically.
@@ -0,0 +1,24 @@ | |||
# Possible training utilities for ensemble attacks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have an metric for this in the library to use 🙂
src/midst_toolkit/evaluation/privacy/mia_scoring.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see why you need this. I think we should leave it here then.
|
||
from midst_toolkit.attacks.ensemble.distance_features import calculate_domias, calculate_gower_features | ||
from midst_toolkit.attacks.ensemble.train_utils import get_tpr_at_fpr | ||
from midst_toolkit.attacks.ensemble.XGBoost import XGBoostHyperparameterTuner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our file naming convention is to be all lower case (xgboost.py
) and our class naming convention is Camel case, ever for acronyms. So we'd name the class XgBoostHyperparameterTuner
Initializes the tuner with data and column information. | ||
:param x: Input features as a DataFrame. | ||
:param y: Target variable as a numpy array. | ||
:param use_gpu: Whether to use GPU acceleration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also expand x and y to something like input_features
and target_variable
"""Groups all tests for the BlendingPlusPlus class.""" | ||
|
||
## Test __init__ ## | ||
# ------------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this guy above is that useful?
from hydra import compose, initialize | ||
from omegaconf import DictConfig | ||
|
||
# The class to be tested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this comment.
from midst_toolkit.attacks.ensemble.blending import BlendingPlusPlus | ||
|
||
|
||
# --- Fixtures: Reusable setup code for tests --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this comment.
with pytest.raises(ValueError, match="meta_classifier_type must be 'lr' or 'xgb'"): | ||
BlendingPlusPlus(data_configs=mock_data_configs, meta_classifier_type="svm") | ||
|
||
## Test _prepare_meta_features ## |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these style of comments are super necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice implementation of the blending++ pipeline! I also double-checked your implementation against the original ensemble codebase, and it matches perfectly 🙂.
I’ve left some comments, and I’ll address the ones where you’ve tagged me.
One suggestion (feel free to ignore): to clarify the different dataframes in this pipeline, in addition to docstrings, we could also reference the descriptions and diagrams in the example README.
trans_json_file_path: ${base_example_dir}/data_configs/trans.json | ||
population_sample_size: 40000 | ||
|
||
# Metadata for real data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is a good idea, but can we create a trans_metadata.json
or real_metadata.json
file to store this information? I am suggesting this because we have several data config files like trans_domain.json
and info.json
with similar type of metadata information. We can keep this config.yaml
for only attack pipeline related configurations. Then we can set the path to this metadata json file here, similar to trans_domain_file_path
and load it in BlendingPlusPlus init.
log(INFO, "Data processing pipeline finished.") | ||
if config.pipeline.run_data_processing: | ||
run_data_processing(config) | ||
elif config.pipeline.run_metaclassifier_training: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably just have another if
here instead of elif
since someone might want to run data processing as well as meta classifier training.
Path(config.data_paths.processed_attack_data_path) / "master_challenge_test_labels.npy", | ||
) | ||
|
||
df_synth = load_dataframe( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the synthetic data generated from the TabDDPM model trained on the df_real_train
data from saved at Path(processed_attack_data_path, "real_train.csv")
? Maybe one comment to say where this synth data is coming from since we still have not added the code for that would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, this synthetic data has come from the original repository because I just wanted to test that it would run. I did the same with the RMIA signals file, only using them as placeholders. I'll add the comment for future reference, though. Thanks!
raise ValueError("meta_classifier_type must be 'lr' or 'xgb'") | ||
self.meta_classifier_type = meta_classifier_type | ||
self.data_configs = data_configs | ||
self.meta_classifier_ = None # The trained model, underscore denotes fitted attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also haven't seen this before. Interested to know if it is commonly used.
|
||
# 3. Get RMIA signals (placeholder) | ||
rmia_signals = pd.read_csv( | ||
"examples/ensemble_attack/data/attack_data/og_rmia_train_meta_pred.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is temporary because we still don't have the RMIA computation pipeline, but can we get this path from config.yaml instead of fixing it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep it like this for now because I'm not passing config.yaml to blending.py and don't plan on doing it. It will be resolved soon though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! Makes sense 👍
|
||
study = optuna.create_study( | ||
direction="maximize", | ||
sampler=optuna.samplers.TPESampler(n_startup_trials=10, seed=np.random.randint(1000)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is the random seed used in the original code, but as David suggested we can make it an option for the class and maybe pass here the user specified random_seed (it could be the one user sets in config.yaml).
colsample_bytree=trial.suggest_float("colsample_bylevel", 0.5, 1), | ||
reg_alpha=trial.suggest_categorical("reg_alpha", [0, 0.1, 0.5, 1, 5, 10]), | ||
reg_lambda=trial.suggest_categorical("reg_lambda", [0, 0.1, 0.5, 1, 5, 10, 100]), | ||
tree_method="auto", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, was there an issue with tree_method="auto" if not use_gpu else "gpu_hist",
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. I wanted it to run on my local without taking up GPU space. Should've just changed use_gpu
to false in config.yaml :)
.gitignore
Outdated
# Trained metaclassifiers | ||
examples/ensemble_attack/trained_models/ | ||
examples/ensemble_attack/attack_results/ | ||
examples/ensemble_attack_example/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't have this last one anymore
"categorical": ["trans_type", "operation", "k_symbol", "bank"] | ||
"variable_to_predict": "trans_type" | ||
|
||
col_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am missing where col_type and bounds are used. Maybe adding some comments here would be helpful.
"examples/ensemble_attack_example/data/attack_data/og_rmia_train_meta_pred.csv" | ||
) # Placeholder for RMIA features | ||
|
||
continuous_features = df_input.loc[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seconding David's comment
Short Description
Clickup Ticket(s): (https://app.clickup.com/t/868fm2hg6)
Both choices of meta classifiers can be trained and saved as pickle files. Predictions can also be done and evaluated with using TPR@FPR. The PR includes preprocessing methods for getting training features as well (e.g. Gower distance, the DOMIAS method.)
Things to note:
Tests Added
Tests have been added for appropriate model fitting, data preprocessing, and the prediction function.