Ensemble attack: Meta classifier pipeline #37

sarakodeiri · 2025-09-16T21:51:34Z

Short Description

Clickup Ticket(s): (https://app.clickup.com/t/868fm2hg6)

Both choices of meta classifiers can be trained and saved as pickle files. Predictions can also be done and evaluated with using TPR@FPR. The PR includes preprocessing methods for getting training features as well (e.g. Gower distance, the DOMIAS method.)

Things to note:

The pipeline has been tested with the data provided by the original repository which is very small in size. Other thorough tests are required when all parts of the pipeline are finalized.
The original repository has an "iteration" count and doesn't use it properly. The same has been implemented here.
This implementation is loyal to the original repository about DOMIAS as well, but it might be better if DOMIAS is used only for non-categorical data, which is something to be tested after the pipeline has been finalized as well.

Tests Added

Tests have been added for appropriate model fitting, data preprocessing, and the prediction function.

* Add XGBoost training pipeline * Add XGBoost and LR pipeline * Metaclassifier training finalized

…te/midst-toolkit into sk/meta_classifier

lotif

Lots of comments, but mostly on small things :)

lotif · 2025-09-18T20:02:44Z

examples/ensemble_attack_example/config.yaml

+        "categories": [
+            "0",
+            "1",
+              "2",


Indentation is inconsistent from this line until the end of the array.

lotif · 2025-09-18T20:05:42Z

examples/ensemble_attack_example/data_configs/info.json

+        3
+    ],
+    "file_type": "csv",
+    "data_path": "/projects/aieng/midst_competition/data/tabddpm/tabddpm_1/train.csv",


This seems to point to a path in the cluster. We should change it to something else, maybe a relative path to this example?

I think this comment no longer applies to this PR considering the recent updates. I have addressed this issue in my branch (upcoming PR), where we actually use this data_path.

lotif · 2025-09-18T20:05:54Z

examples/ensemble_attack_example/data_configs/trans.json

@@ -0,0 +1,58 @@
+{
+    "general": {
+        "data_dir": "/projects/aieng/midst_competition/data/tabddpm/tabddpm_1",


Same issue with paths in this file.

lotif · 2025-09-18T20:06:39Z

examples/ensemble_attack_example/real_data_collection.py

+    CLAVADDPM_WHITE_BOX = "clavaddpm_white_box"
+
+
+def expand_ranges(ranges):


Missing type hints here.

lotif · 2025-09-18T20:07:26Z

examples/ensemble_attack_example/real_data_collection.py

+
+def expand_ranges(ranges):
+    """
+    Reads a list of tuples representing ranges and expands them into a flat list of integers.


I think Receives instead of Reads here is more appropriate.

lotif · 2025-09-18T20:52:45Z

src/midst_toolkit/attacks/ensemble/distance_features.py

+    Returns:
+        A dataframe with the new distance-based features, indexed like df_input.
+    """
+    cat_features = [col in cat_cols for col in df_input.columns]


categorical_features instead of cat_features.

also col -> column

lotif · 2025-09-18T20:55:39Z

src/midst_toolkit/attacks/ensemble/distance_features.py

+        cat_cols: A list of categorical column names.
+
+    Returns:
+        A dataframe with the new distance-based features, indexed like df_input.


We should also tell the structure of this dataframe, so the person using this for the first time doesn't have to inspect the code beforehand. For instance, mention it has the columns min_gower_distance, nndr etc. and what is supposed to be contained in them.

lotif · 2025-09-18T20:56:52Z

src/midst_toolkit/attacks/ensemble/distance_features.py

+    return pd.DataFrame(features, index=df_input.index)
+
+
+def calculate_domias(df_input: pd.DataFrame, df_synth: pd.DataFrame, df_ref: pd.DataFrame) -> np.ndarray:


A better name would be calculate_domias_scores maybe?

Also, same comment here about spelling out parameter names.

I feel like all internal variables in this function could benefit from better names.

lotif · 2025-09-18T21:01:13Z

src/midst_toolkit/attacks/ensemble/train_utils.py

@@ -0,0 +1,24 @@
+# Possible training utilities for ensemble attacks.


Make it into a docstring, but remove Possible.

lotif · 2025-09-18T21:01:48Z

src/midst_toolkit/attacks/ensemble/train_utils.py

+            that a data point is a member.
+    :param max_fpr: threshold on the FPR.
+
+    return: The TPR at `max_fpr` FPR.


Docstring format should be google style.

emersodb

Your tests are very extensive and I think this is a good start, but we definitely want to bring in certain elements of our conventions and documentation.

emersodb · 2025-09-23T21:22:55Z

examples/ensemble_attack_example/config.yaml

+          ]
+
+
+# Training settings (placeholder)


Maybe this will become clear later, but I'm not sure what we mean by "placeholder" here and below.

For the "Training" settings, I'm assuming the numbers we've assigned are placeholder numbers. But for the "Metaclassifier" settings, I'm not sure if I want to keep the "epoch" variable because it's not used in the current implementation. Couldn't think of a better name that would capture the temporary nature of it.

emersodb · 2025-09-23T21:26:40Z

examples/ensemble_attack_example/data_configs/dataset_meta.json

I'm a little confused as to what is happening here in that some of these files already exist in the ensemble_attack/ example folder. It seems like we're duplicating it?

I'll wait to review these files until after we get that sorted out 🙂

emersodb · 2025-09-23T21:28:30Z

src/midst_toolkit/attacks/ensemble/blending.py

@@ -0,0 +1,141 @@
+# Blending++ orchestrator, equivalent to blending_plus_plus.py in the submission repo


I would create a link to the repository specifically.

emersodb · 2025-09-23T21:30:27Z

src/midst_toolkit/attacks/ensemble/train_utils.py

@@ -0,0 +1,24 @@
+# Possible training utilities for ensemble attacks.


We already have an metric for this in the library to use 🙂
src/midst_toolkit/evaluation/privacy/mia_scoring.py

Ah, I see why you need this. I think we should leave it here then.

emersodb · 2025-09-23T21:31:48Z

src/midst_toolkit/attacks/ensemble/blending.py

+
+from midst_toolkit.attacks.ensemble.distance_features import calculate_domias, calculate_gower_features
+from midst_toolkit.attacks.ensemble.train_utils import get_tpr_at_fpr
+from midst_toolkit.attacks.ensemble.XGBoost import XGBoostHyperparameterTuner


Our file naming convention is to be all lower case (xgboost.py) and our class naming convention is Camel case, ever for acronyms. So we'd name the class XgBoostHyperparameterTuner

emersodb · 2025-09-23T22:19:09Z

src/midst_toolkit/attacks/ensemble/XGBoost.py

+        Initializes the tuner with data and column information.
+        :param x: Input features as a DataFrame.
+        :param y: Target variable as a numpy array.
+        :param use_gpu: Whether to use GPU acceleration.


Also expand x and y to something like input_features and target_variable

emersodb · 2025-09-23T22:23:04Z

tests/unit/attacks/ensemble/test_meta_classifier.py

+    """Groups all tests for the BlendingPlusPlus class."""
+
+    ## Test __init__ ##
+    # ------------------


I don't think this guy above is that useful?

emersodb · 2025-09-23T22:23:22Z

tests/unit/attacks/ensemble/test_meta_classifier.py

+from hydra import compose, initialize
+from omegaconf import DictConfig
+
+# The class to be tested


I don't think we need this comment.

emersodb · 2025-09-23T22:23:30Z

tests/unit/attacks/ensemble/test_meta_classifier.py

+from midst_toolkit.attacks.ensemble.blending import BlendingPlusPlus
+
+
+# --- Fixtures: Reusable setup code for tests ---


I don't think we need this comment.

emersodb · 2025-09-23T22:24:05Z

tests/unit/attacks/ensemble/test_meta_classifier.py

+        with pytest.raises(ValueError, match="meta_classifier_type must be 'lr' or 'xgb'"):
+            BlendingPlusPlus(data_configs=mock_data_configs, meta_classifier_type="svm")
+
+    ## Test _prepare_meta_features ##


Most of these style of comments are super necessary.

fatemetkl

Really nice implementation of the blending++ pipeline! I also double-checked your implementation against the original ensemble codebase, and it matches perfectly 🙂.

I’ve left some comments, and I’ll address the ones where you’ve tagged me.
One suggestion (feel free to ignore): to clarify the different dataframes in this pipeline, in addition to docstrings, we could also reference the descriptions and diagrams in the example README.

fatemetkl · 2025-09-26T16:03:18Z

examples/ensemble_attack/config.yaml

  trans_json_file_path: ${base_example_dir}/data_configs/trans.json
  population_sample_size: 40000

+# Metadata for real data


Not sure if this is a good idea, but can we create a trans_metadata.json or real_metadata.json file to store this information? I am suggesting this because we have several data config files like trans_domain.json and info.json with similar type of metadata information. We can keep this config.yaml for only attack pipeline related configurations. Then we can set the path to this metadata json file here, similar to trans_domain_file_path and load it in BlendingPlusPlus init.

fatemetkl · 2025-09-26T16:06:15Z

examples/ensemble_attack/run_attack.py

-        log(INFO, "Data processing pipeline finished.")
+    if config.pipeline.run_data_processing:
+        run_data_processing(config)
+    elif config.pipeline.run_metaclassifier_training:


We can probably just have another if here instead of elif since someone might want to run data processing as well as meta classifier training.

fatemetkl · 2025-09-26T16:24:15Z

examples/ensemble_attack/run_attack.py

+        Path(config.data_paths.processed_attack_data_path) / "master_challenge_test_labels.npy",
+    )
+
+    df_synth = load_dataframe(


Is this the synthetic data generated from the TabDDPM model trained on the df_real_train data from saved at Path(processed_attack_data_path, "real_train.csv")? Maybe one comment to say where this synth data is coming from since we still have not added the code for that would be helpful.

For now, this synthetic data has come from the original repository because I just wanted to test that it would run. I did the same with the RMIA signals file, only using them as placeholders. I'll add the comment for future reference, though. Thanks!

fatemetkl · 2025-09-26T17:01:00Z

src/midst_toolkit/attacks/ensemble/blending.py

+            raise ValueError("meta_classifier_type must be 'lr' or 'xgb'")
+        self.meta_classifier_type = meta_classifier_type
+        self.data_configs = data_configs
+        self.meta_classifier_ = None  # The trained model, underscore denotes fitted attribute


I also haven't seen this before. Interested to know if it is commonly used.

fatemetkl · 2025-09-26T17:11:45Z

src/midst_toolkit/attacks/ensemble/blending.py

+
+        # 3. Get RMIA signals (placeholder)
+        rmia_signals = pd.read_csv(
+            "examples/ensemble_attack/data/attack_data/og_rmia_train_meta_pred.csv"


I know this is temporary because we still don't have the RMIA computation pipeline, but can we get this path from config.yaml instead of fixing it here?

I'd rather keep it like this for now because I'm not passing config.yaml to blending.py and don't plan on doing it. It will be resolved soon though!

Got it! Makes sense 👍

fatemetkl · 2025-09-26T19:14:09Z

src/midst_toolkit/attacks/ensemble/XGBoost.py

+
+        study = optuna.create_study(
+            direction="maximize",
+            sampler=optuna.samplers.TPESampler(n_startup_trials=10, seed=np.random.randint(1000)),


I see this is the random seed used in the original code, but as David suggested we can make it an option for the class and maybe pass here the user specified random_seed (it could be the one user sets in config.yaml).

fatemetkl · 2025-09-26T19:19:06Z

src/midst_toolkit/attacks/ensemble/XGBoost.py

+                        colsample_bytree=trial.suggest_float("colsample_bylevel", 0.5, 1),
+                        reg_alpha=trial.suggest_categorical("reg_alpha", [0, 0.1, 0.5, 1, 5, 10]),
+                        reg_lambda=trial.suggest_categorical("reg_lambda", [0, 0.1, 0.5, 1, 5, 10, 100]),
+                        tree_method="auto",


Out of curiosity, was there an issue with tree_method="auto" if not use_gpu else "gpu_hist", ?

Nope. I wanted it to run on my local without taking up GPU space. Should've just changed use_gpu to false in config.yaml :)

fatemetkl · 2025-09-26T19:30:20Z

.gitignore

+# Trained metaclassifiers
+examples/ensemble_attack/trained_models/
+examples/ensemble_attack/attack_results/
+examples/ensemble_attack_example/


I think we don't have this last one anymore

fatemetkl · 2025-09-26T19:38:21Z

examples/ensemble_attack/config.yaml

+      "categorical": ["trans_type", "operation", "k_symbol", "bank"]
+      "variable_to_predict": "trans_type"
+
+  col_type:


I think I am missing where col_type and bounds are used. Maybe adding some comments here would be helpful.

fatemetkl · 2025-09-26T19:43:13Z

src/midst_toolkit/attacks/ensemble/blending.py

+            "examples/ensemble_attack_example/data/attack_data/og_rmia_train_meta_pred.csv"
+        )  # Placeholder for RMIA features
+
+        continuous_features = df_input.loc[


Seconding David's comment

fatemetkl and others added 30 commits August 21, 2025 10:55

Data processing and shadow models

8b7a781

Merged main

71178c0

mypy fixes

5e4e3d9

Removed rmia component

f0a849c

Added data collection, processing, and tests

7c4d47c

Small fixes

4f963b3

Updated README

7746ab3

Merged main

a812665

Removed some parts that should go with the next PR

a352d36

Small fixes

bb662c7

Simplifying mypy legacy type check to only check .py files

ca0e2d5

Added an example for MIDST competition ensemble attack, updates tests

ba22d92

Fixed docstrings

6a6b99d

fix

8e764e1

mypy fixes

e740b76

mypy fix

8574d22

Added hydra and omegaconf to pyproject.toml

aec7458

David's comments, added a simple bash script to example

5516989

Improved comments and function name

be2ac7d

Updated readme with diagrams

3dadde3

Updated readme

0442cbb

Updated readme

aad598b

Modify file structure

86d1e0a

Modify basic file structure

fb3036a

Modify high-level pipeline. (Needs a lot of cleanup.)

cfa821b

Add DOMIAS calculation (#31)

eb38a75

Add Gower distance (#32)

a6c06dd

Add gower to uv

653d9ac

Metaclassifier Training (#36)

62d0c1f

* Add XGBoost training pipeline * Add XGBoost and LR pipeline * Metaclassifier training finalized

Add predict and TPR@FPR

cdf9f4a

sarakodeiri marked this pull request as ready for review September 17, 2025 20:15

Merge branch 'main' into sk/meta_classifier

b655f6f

sarakodeiri requested review from fatemetkl, lotif, masi-sh, emersodb, bzamanlooy and ElahehBassak September 17, 2025 20:17

sarakodeiri added 3 commits September 17, 2025 16:23

Add packages

ad59425

Merge branch 'sk/meta_classifier' of https://github.com/VectorInstitu…

afce376

…te/midst-toolkit into sk/meta_classifier

Fix end of file (.gitignore)

f3c2aee

lotif requested changes Sep 18, 2025

View reviewed changes

sarakodeiri force-pushed the sk/meta_classifier branch from e93dc1e to f3c2aee Compare September 19, 2025 20:14

sarakodeiri and others added 2 commits September 19, 2025 16:25

Merge branch 'main' into sk/meta_classifier

377cb14

Change run script format

1ff1ec1

emersodb reviewed Sep 23, 2025

View reviewed changes

sarakodeiri added 10 commits September 24, 2025 11:57

Fix XGBoost Docstrings

d26a22e

Merge remote-tracking branch 'origin/main' into sk/meta_classifier

421e045

Add meta classifier type enum

6feaf63

Resolved Marcelo's comments

02eb5ae

Fix tests

0710806

Update

5fecc68

Merge branch 'main' into sk/meta_classifier

605052b

Fix build

e116f9d

Remove ensemble_attack_examples

4f9b5b1

Ruff fix

4ec2631

fatemetkl reviewed Sep 26, 2025

View reviewed changes

sarakodeiri added 3 commits September 29, 2025 11:57

Apply David's comments

a068418

Apply David's comments, pt. 2.

b77b802

Apply Fatemeh's comments

a2837ae

		CLAVADDPM_WHITE_BOX = "clavaddpm_white_box"


		def expand_ranges(ranges):

		return pd.DataFrame(features, index=df_input.index)


		def calculate_domias(df_input: pd.DataFrame, df_synth: pd.DataFrame, df_ref: pd.DataFrame) -> np.ndarray:

		@@ -0,0 +1,24 @@
		# Possible training utilities for ensemble attacks.

		@@ -0,0 +1,141 @@
		# Blending++ orchestrator, equivalent to blending_plus_plus.py in the submission repo

		from midst_toolkit.attacks.ensemble.blending import BlendingPlusPlus


		# --- Fixtures: Reusable setup code for tests ---

Ensemble attack: Meta classifier pipeline #37

Are you sure you want to change the base?

Ensemble attack: Meta classifier pipeline #37

Uh oh!

Conversation

sarakodeiri commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Short Description

Tests Added

Uh oh!

lotif left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emersodb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fatemetkl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

sarakodeiri commented Sep 16, 2025 •

edited

Loading