Skip to content

Conversation

sarakodeiri
Copy link
Collaborator

@sarakodeiri sarakodeiri commented Sep 16, 2025

Short Description

Clickup Ticket(s): (https://app.clickup.com/t/868fm2hg6)

Both choices of meta classifiers can be trained and saved as pickle files. Predictions can also be done and evaluated with using TPR@FPR. The PR includes preprocessing methods for getting training features as well (e.g. Gower distance, the DOMIAS method.)

Things to note:

  • The pipeline has been tested with the data provided by the original repository which is very small in size. Other thorough tests are required when all parts of the pipeline are finalized.
  • The original repository has an "iteration" count and doesn't use it properly. The same has been implemented here.
  • This implementation is loyal to the original repository about DOMIAS as well, but it might be better if DOMIAS is used only for non-categorical data, which is something to be tested after the pipeline has been finalized as well.

Tests Added

Tests have been added for appropriate model fitting, data preprocessing, and the prediction function.

@sarakodeiri sarakodeiri marked this pull request as ready for review September 17, 2025 20:15
Copy link
Collaborator

@lotif lotif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of comments, but mostly on small things :)

"categories": [
"0",
"1",
"2",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation is inconsistent from this line until the end of the array.

3
],
"file_type": "csv",
"data_path": "/projects/aieng/midst_competition/data/tabddpm/tabddpm_1/train.csv",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to point to a path in the cluster. We should change it to something else, maybe a relative path to this example?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment no longer applies to this PR considering the recent updates. I have addressed this issue in my branch (upcoming PR), where we actually use this data_path.

@@ -0,0 +1,58 @@
{
"general": {
"data_dir": "/projects/aieng/midst_competition/data/tabddpm/tabddpm_1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue with paths in this file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

CLAVADDPM_WHITE_BOX = "clavaddpm_white_box"


def expand_ranges(ranges):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing type hints here.


def expand_ranges(ranges):
"""
Reads a list of tuples representing ranges and expands them into a flat list of integers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Receives instead of Reads here is more appropriate.

Returns:
A dataframe with the new distance-based features, indexed like df_input.
"""
cat_features = [col in cat_cols for col in df_input.columns]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

categorical_features instead of cat_features.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also col -> column

cat_cols: A list of categorical column names.
Returns:
A dataframe with the new distance-based features, indexed like df_input.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also tell the structure of this dataframe, so the person using this for the first time doesn't have to inspect the code beforehand. For instance, mention it has the columns min_gower_distance, nndr etc. and what is supposed to be contained in them.

return pd.DataFrame(features, index=df_input.index)


def calculate_domias(df_input: pd.DataFrame, df_synth: pd.DataFrame, df_ref: pd.DataFrame) -> np.ndarray:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better name would be calculate_domias_scores maybe?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, same comment here about spelling out parameter names.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like all internal variables in this function could benefit from better names.

@@ -0,0 +1,24 @@
# Possible training utilities for ensemble attacks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it into a docstring, but remove Possible.

that a data point is a member.
:param max_fpr: threshold on the FPR.
return: The TPR at `max_fpr` FPR.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring format should be google style.

Copy link
Collaborator

@emersodb emersodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your tests are very extensive and I think this is a good start, but we definitely want to bring in certain elements of our conventions and documentation.

]


# Training settings (placeholder)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this will become clear later, but I'm not sure what we mean by "placeholder" here and below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the "Training" settings, I'm assuming the numbers we've assigned are placeholder numbers. But for the "Metaclassifier" settings, I'm not sure if I want to keep the "epoch" variable because it's not used in the current implementation. Couldn't think of a better name that would capture the temporary nature of it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused as to what is happening here in that some of these files already exist in the ensemble_attack/ example folder. It seems like we're duplicating it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait to review these files until after we get that sorted out 🙂

@@ -0,0 +1,141 @@
# Blending++ orchestrator, equivalent to blending_plus_plus.py in the submission repo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would create a link to the repository specifically.

@@ -0,0 +1,24 @@
# Possible training utilities for ensemble attacks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have an metric for this in the library to use 🙂
src/midst_toolkit/evaluation/privacy/mia_scoring.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see why you need this. I think we should leave it here then.


from midst_toolkit.attacks.ensemble.distance_features import calculate_domias, calculate_gower_features
from midst_toolkit.attacks.ensemble.train_utils import get_tpr_at_fpr
from midst_toolkit.attacks.ensemble.XGBoost import XGBoostHyperparameterTuner
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our file naming convention is to be all lower case (xgboost.py) and our class naming convention is Camel case, ever for acronyms. So we'd name the class XgBoostHyperparameterTuner

Initializes the tuner with data and column information.
:param x: Input features as a DataFrame.
:param y: Target variable as a numpy array.
:param use_gpu: Whether to use GPU acceleration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also expand x and y to something like input_features and target_variable

"""Groups all tests for the BlendingPlusPlus class."""

## Test __init__ ##
# ------------------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this guy above is that useful?

from hydra import compose, initialize
from omegaconf import DictConfig

# The class to be tested
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this comment.

from midst_toolkit.attacks.ensemble.blending import BlendingPlusPlus


# --- Fixtures: Reusable setup code for tests ---
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this comment.

with pytest.raises(ValueError, match="meta_classifier_type must be 'lr' or 'xgb'"):
BlendingPlusPlus(data_configs=mock_data_configs, meta_classifier_type="svm")

## Test _prepare_meta_features ##
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these style of comments are super necessary.

Copy link
Collaborator

@fatemetkl fatemetkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice implementation of the blending++ pipeline! I also double-checked your implementation against the original ensemble codebase, and it matches perfectly 🙂.

I’ve left some comments, and I’ll address the ones where you’ve tagged me.
One suggestion (feel free to ignore): to clarify the different dataframes in this pipeline, in addition to docstrings, we could also reference the descriptions and diagrams in the example README.

trans_json_file_path: ${base_example_dir}/data_configs/trans.json
population_sample_size: 40000

# Metadata for real data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is a good idea, but can we create a trans_metadata.json or real_metadata.json file to store this information? I am suggesting this because we have several data config files like trans_domain.json and info.json with similar type of metadata information. We can keep this config.yaml for only attack pipeline related configurations. Then we can set the path to this metadata json file here, similar to trans_domain_file_path and load it in BlendingPlusPlus init.

log(INFO, "Data processing pipeline finished.")
if config.pipeline.run_data_processing:
run_data_processing(config)
elif config.pipeline.run_metaclassifier_training:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably just have another if here instead of elif since someone might want to run data processing as well as meta classifier training.

Path(config.data_paths.processed_attack_data_path) / "master_challenge_test_labels.npy",
)

df_synth = load_dataframe(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the synthetic data generated from the TabDDPM model trained on the df_real_train data from saved at Path(processed_attack_data_path, "real_train.csv")? Maybe one comment to say where this synth data is coming from since we still have not added the code for that would be helpful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, this synthetic data has come from the original repository because I just wanted to test that it would run. I did the same with the RMIA signals file, only using them as placeholders. I'll add the comment for future reference, though. Thanks!

raise ValueError("meta_classifier_type must be 'lr' or 'xgb'")
self.meta_classifier_type = meta_classifier_type
self.data_configs = data_configs
self.meta_classifier_ = None # The trained model, underscore denotes fitted attribute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also haven't seen this before. Interested to know if it is commonly used.


# 3. Get RMIA signals (placeholder)
rmia_signals = pd.read_csv(
"examples/ensemble_attack/data/attack_data/og_rmia_train_meta_pred.csv"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is temporary because we still don't have the RMIA computation pipeline, but can we get this path from config.yaml instead of fixing it here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep it like this for now because I'm not passing config.yaml to blending.py and don't plan on doing it. It will be resolved soon though!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Makes sense 👍


study = optuna.create_study(
direction="maximize",
sampler=optuna.samplers.TPESampler(n_startup_trials=10, seed=np.random.randint(1000)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is the random seed used in the original code, but as David suggested we can make it an option for the class and maybe pass here the user specified random_seed (it could be the one user sets in config.yaml).

colsample_bytree=trial.suggest_float("colsample_bylevel", 0.5, 1),
reg_alpha=trial.suggest_categorical("reg_alpha", [0, 0.1, 0.5, 1, 5, 10]),
reg_lambda=trial.suggest_categorical("reg_lambda", [0, 0.1, 0.5, 1, 5, 10, 100]),
tree_method="auto",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, was there an issue with tree_method="auto" if not use_gpu else "gpu_hist", ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. I wanted it to run on my local without taking up GPU space. Should've just changed use_gpu to false in config.yaml :)

.gitignore Outdated
# Trained metaclassifiers
examples/ensemble_attack/trained_models/
examples/ensemble_attack/attack_results/
examples/ensemble_attack_example/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't have this last one anymore

"categorical": ["trans_type", "operation", "k_symbol", "bank"]
"variable_to_predict": "trans_type"

col_type:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am missing where col_type and bounds are used. Maybe adding some comments here would be helpful.

"examples/ensemble_attack_example/data/attack_data/og_rmia_train_meta_pred.csv"
) # Placeholder for RMIA features

continuous_features = df_input.loc[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconding David's comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants