Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
8b7a781
Data processing and shadow models
fatemetkl Aug 21, 2025
71178c0
Merged main
fatemetkl Aug 21, 2025
5e4e3d9
mypy fixes
fatemetkl Aug 21, 2025
f0a849c
Removed rmia component
fatemetkl Aug 25, 2025
7c4d47c
Added data collection, processing, and tests
fatemetkl Aug 26, 2025
4f963b3
Small fixes
fatemetkl Aug 26, 2025
7746ab3
Updated README
fatemetkl Aug 26, 2025
a812665
Merged main
fatemetkl Aug 26, 2025
a352d36
Removed some parts that should go with the next PR
fatemetkl Aug 27, 2025
bb662c7
Small fixes
fatemetkl Aug 27, 2025
ca0e2d5
Simplifying mypy legacy type check to only check .py files
fatemetkl Sep 2, 2025
ba22d92
Added an example for MIDST competition ensemble attack, updates tests
fatemetkl Sep 3, 2025
6a6b99d
Fixed docstrings
fatemetkl Sep 4, 2025
8e764e1
fix
fatemetkl Sep 4, 2025
e740b76
mypy fixes
fatemetkl Sep 4, 2025
8574d22
mypy fix
fatemetkl Sep 4, 2025
aec7458
Added hydra and omegaconf to pyproject.toml
fatemetkl Sep 4, 2025
5516989
David's comments, added a simple bash script to example
fatemetkl Sep 5, 2025
be2ac7d
Improved comments and function name
fatemetkl Sep 5, 2025
3dadde3
Updated readme with diagrams
fatemetkl Sep 5, 2025
0442cbb
Updated readme
fatemetkl Sep 5, 2025
aad598b
Updated readme
fatemetkl Sep 5, 2025
86d1e0a
Modify file structure
Sep 8, 2025
fb3036a
Modify basic file structure
Sep 8, 2025
cfa821b
Modify high-level pipeline. (Needs a lot of cleanup.)
sarakodeiri Sep 9, 2025
eb38a75
Add DOMIAS calculation (#31)
sarakodeiri Sep 9, 2025
a6c06dd
Add Gower distance (#32)
sarakodeiri Sep 10, 2025
653d9ac
Add gower to uv
sarakodeiri Sep 10, 2025
62d0c1f
Metaclassifier Training (#36)
sarakodeiri Sep 16, 2025
cdf9f4a
Add predict and TPR@FPR
sarakodeiri Sep 17, 2025
91b827e
Fix uv packages
sarakodeiri Sep 17, 2025
4f1c55c
Add tests
sarakodeiri Sep 17, 2025
da4e1f8
Minor cleanup
sarakodeiri Sep 17, 2025
b655f6f
Merge branch 'main' into sk/meta_classifier
sarakodeiri Sep 17, 2025
ad59425
Add packages
sarakodeiri Sep 17, 2025
afce376
Merge branch 'sk/meta_classifier' of https://github.com/VectorInstitu…
sarakodeiri Sep 17, 2025
f3c2aee
Fix end of file (.gitignore)
sarakodeiri Sep 18, 2025
377cb14
Merge branch 'main' into sk/meta_classifier
sarakodeiri Sep 19, 2025
1ff1ec1
Change run script format
sarakodeiri Sep 22, 2025
d26a22e
Fix XGBoost Docstrings
sarakodeiri Sep 24, 2025
421e045
Merge remote-tracking branch 'origin/main' into sk/meta_classifier
sarakodeiri Sep 24, 2025
6feaf63
Add meta classifier type enum
sarakodeiri Sep 24, 2025
02eb5ae
Resolved Marcelo's comments
sarakodeiri Sep 24, 2025
0710806
Fix tests
sarakodeiri Sep 24, 2025
5fecc68
Update
sarakodeiri Sep 24, 2025
605052b
Merge branch 'main' into sk/meta_classifier
sarakodeiri Sep 25, 2025
e116f9d
Fix build
sarakodeiri Sep 25, 2025
4f9b5b1
Remove ensemble_attack_examples
sarakodeiri Sep 25, 2025
4ec2631
Ruff fix
sarakodeiri Sep 25, 2025
a068418
Apply David's comments
sarakodeiri Sep 29, 2025
b77b802
Apply David's comments, pt. 2.
sarakodeiri Sep 29, 2025
a2837ae
Apply Fatemeh's comments
sarakodeiri Sep 29, 2025
418d65b
Remove bounds and col_type
sarakodeiri Oct 1, 2025
2c034cc
Merge branch 'main' into sk/meta_classifier
sarakodeiri Oct 1, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@ wheels/
# Dataset files
examples/**/data/

# Trained metaclassifiers
examples/ensemble_attack/trained_models/
examples/ensemble_attack/attack_results/

# hydra output
outputs/

Expand Down
37 changes: 29 additions & 8 deletions examples/ensemble_attack/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,21 +8,27 @@ data_paths:
midst_data_path: ${base_data_dir}/midst_data_all_attacks # Used only for reading the data
population_path: ${base_data_dir}/population_data # Path where the population data should be stored
processed_attack_data_path: ${base_data_dir}/attack_data # Path where the processed attack real train and evaluation data is stored
attack_results_path: ${base_example_dir}/attack_results # Path where the attack results will be stored

model_paths:
shadow_models_path: ${base_example_dir}/shadow_models # Path where the shadow models are stored
metaclassifier_model_path: ${base_example_dir}/trained_models # Path where the trained metaclassifier model will be saved

# Pipeline control
pipeline:
run_data_processing: true
run_data_processing: false
run_metaclassifier_training: true

# Dataset specific information used for processing in this example
data_processing_config:
collect_attack_data_types:
[
"tabddpm_black_box",
"tabsyn_black_box",
"tabddpm_white_box",
"tabsyn_white_box",
"clavaddpm_black_box",
"clavaddpm_white_box",
[
"tabddpm_black_box",
"tabsyn_black_box",
"tabddpm_white_box",
"tabsyn_white_box",
"clavaddpm_black_box",
"clavaddpm_white_box",
]
# The column name in the data to be used for stratified splitting.
column_to_stratify: "trans_type" # Attention: This value is not documented in the original codebase.
Expand All @@ -34,18 +40,33 @@ data_processing_config:
single_table_train_data_file_name: "train_with_id.csv"
multi_table_train_data_file_name: "trans.csv"
challenge_data_file_name: "challenge_with_id.csv"

# Data Config files path
trans_domain_file_path: ${base_example_dir}/data_configs/trans_domain.json
dataset_meta_file_path: ${base_example_dir}/data_configs/dataset_meta.json
trans_json_file_path: ${base_example_dir}/data_configs/trans.json
population_sample_size: 40000

# Metadata for real data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is a good idea, but can we create a trans_metadata.json or real_metadata.json file to store this information? I am suggesting this because we have several data config files like trans_domain.json and info.json with similar type of metadata information. We can keep this config.yaml for only attack pipeline related configurations. Then we can set the path to this metadata json file here, similar to trans_domain_file_path and load it in BlendingPlusPlus init.

data_configs:
metadata:
"numerical": ["trans_date", "amount", "balance", "account"]
"categorical": ["trans_type", "operation", "k_symbol", "bank"]
"variable_to_predict": "trans_type"


# Training settings (placeholder)
shadow_training:
epochs: 10
learning_rate: 0.001
batch_size: 64
model_type: "tabddpm"

# Metaclassifier settings (placeholder)
metaclassifier:
model_type: "xgb"
use_gpu: true
epochs: 1

# General settings
random_seed: 42
4 changes: 2 additions & 2 deletions examples/ensemble_attack/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ source .venv/bin/activate
echo "Active Environment:"
which python

echo Experiments Launched
echo "Experiments Launched"

python -m examples.ensemble_attack.run_attack

echo Experiments Completed
echo "Experiments Completed"
142 changes: 122 additions & 20 deletions examples/ensemble_attack/run_attack.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,44 +3,146 @@
provided resources and data.
"""

import pickle
from datetime import datetime
from logging import INFO
from pathlib import Path

import hydra
import numpy as np
from omegaconf import DictConfig

from examples.ensemble_attack.real_data_collection import collect_population_data_ensemble
from midst_toolkit.attacks.ensemble.blending import BlendingPlusPlus, MetaClassifierType
from midst_toolkit.attacks.ensemble.data_utils import load_dataframe
from midst_toolkit.attacks.ensemble.process_split_data import process_split_data
from midst_toolkit.common.logger import log


def run_data_processing(config: DictConfig) -> None:
"""
Function to run the data processing pipeline.
Args:
config: Configuration object set in config.yaml.
"""
log(INFO, "Running data processing pipeline...")
# Collect the real data from the MIDST challenge resources.
population_data = collect_population_data_ensemble(
midst_data_input_dir=Path(config.data_paths.midst_data_path),
data_processing_config=config.data_processing_config,
save_dir=Path(config.data_paths.population_path),
)
# The following function saves the required dataframe splits in the specified processed_attack_data_path path.
process_split_data(
all_population_data=population_data,
processed_attack_data_path=Path(config.data_paths.processed_attack_data_path),
# TODO: column_to_stratify value is not documented in the original codebase.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not true anymore? I see docstrings in that function right now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think by "original codebase", @fatemetkl meant the submission repository (link), but I'm not sure what the "TODO" is for. My guess is we test things with stratified columns specified.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
The "original codebase" in this comment refers to the attack submission. I should have added a link. So sorry for the confusion! I will fix it in my PR.

As Sara mentioned, since this parameter wasn’t documented in the original attack codebase, I added a TODO to experiment with other columns in case the one I specified isn’t the correct one.

column_to_stratify=config.data_processing_config.column_to_stratify,
num_total_samples=config.data_processing_config.population_sample_size,
random_seed=config.random_seed,
)
log(INFO, "Data processing pipeline finished.")


def run_metaclassifier_training(config: DictConfig) -> None:
"""
Fuction to run the metaclassifier training and evaluation.
Args:
config: Configuration object set in config.yaml.
"""
log(INFO, "Running metaclassifier training...")
# Load the processed data splits.
df_meta_train = load_dataframe(
Path(config.data_paths.processed_attack_data_path),
"master_challenge_train.csv",
)
y_meta_train = np.load(
Path(config.data_paths.processed_attack_data_path) / "master_challenge_train_labels.npy",
)
df_meta_test = load_dataframe(
Path(config.data_paths.processed_attack_data_path),
"master_challenge_test.csv",
)
y_meta_test = np.load(
Path(config.data_paths.processed_attack_data_path) / "master_challenge_test_labels.npy",
)

# Synthetic data borrowed from the attack implementation repository.
# TODO: Change this file path to the path where the synthetic data is stored.
df_synthetic = load_dataframe(
Path(config.data_paths.processed_attack_data_path),
"synth.csv",
)

df_reference = load_dataframe(
Path(config.data_paths.population_path),
"population_all_with_challenge_no_id.csv",
)

# Fit the metaclassifier.
meta_classifier_enum = MetaClassifierType(config.metaclassifier.model_type)

# 1. Initialize the attacker
blending_attacker = BlendingPlusPlus(
data_configs=config.data_configs, meta_classifier_type=meta_classifier_enum, random_seed=config.random_seed
)
log(INFO, f"{meta_classifier_enum} created with random seed {config.random_seed}, starting training...")

# 2. Train the attacker on the meta-train set

blending_attacker.fit(
df_train=df_meta_train,
y_train=y_meta_train,
df_synthetic=df_synthetic,
df_reference=df_reference,
use_gpu=config.metaclassifier.use_gpu,
epochs=config.metaclassifier.epochs,
)

log(INFO, "Metaclassifier training finished.")

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
model_filename = f"{timestamp}_{config.metaclassifier.model_type}_trained_metaclassifier.pkl"
with open(Path(config.model_paths.metaclassifier_model_path) / model_filename, "wb") as f:
pickle.dump(blending_attacker.trained_model, f)

log(INFO, "Metaclassifier model saved, starting evaluation...")

# 3. Get predictions on the test set
probabilities, pred_score = blending_attacker.predict(
df_test=df_meta_test,
df_synthetic=df_synthetic,
df_reference=df_reference,
y_test=y_meta_test,
)

# Save the prediction probabilities
np.save(
Path(config.data_paths.attack_results_path)
/ f"{timestamp}_{config.metaclassifier.model_type}_test_pred_proba.npy",
probabilities,
)
log(INFO, "Test set prediction probabilities saved.")

if pred_score is not None:
log(INFO, f"TPR at FPR=0.1: {pred_score:.4f}")


@hydra.main(config_path=".", config_name="config", version_base=None)
def main(cfg: DictConfig) -> None:
def main(config: DictConfig) -> None:
"""
Run the Ensemble Attack example pipeline.
As the first step, data processing is done.
Args:
cfg: Attack OmegaConf DictConfig object.
config: Attack configuration as an OmegaConf DictConfig object.
"""
if cfg.pipeline.run_data_processing:
log(INFO, "Running data processing pipeline...")
# Collect the real data from the MIDST challenge resources.
population_data = collect_population_data_ensemble(
midst_data_input_dir=Path(cfg.data_paths.midst_data_path),
data_processing_config=cfg.data_processing_config,
save_dir=Path(cfg.data_paths.population_path),
)
# The following function saves the required dataframe splits in the specified processed_attack_data_path path.
process_split_data(
all_population_data=population_data,
processed_attack_data_path=Path(cfg.data_paths.processed_attack_data_path),
# TODO: column_to_stratify value is not documented in the original codebase.
column_to_stratify=cfg.data_processing_config.column_to_stratify,
num_total_samples=cfg.data_processing_config.population_sample_size,
random_seed=cfg.random_seed,
)
log(INFO, "Data processing pipeline finished.")
if config.pipeline.run_data_processing:
run_data_processing(config)
if config.pipeline.run_metaclassifier_training:
run_metaclassifier_training(config)


if __name__ == "__main__":
Expand Down
3 changes: 3 additions & 0 deletions mypy.ini
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,6 @@ ignore_missing_imports = True

[mypy-category_encoders.*]
ignore_missing_imports = True

[mypy-gower.*]
ignore_missing_imports = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept getting a "Skipping analyzing "gower": module is installed, but missing library stubs or py.typed" error, and no stub files are available.

8 changes: 6 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@ authors = [ {name = "Vector AI Engineering", email = "ai_engineering@vectorinsti
license = "MIT"
repository = "https://github.com/VectorInstitute/midst-toolkit"
requires-python = ">=3.12"
dependencies = []
dependencies = [
]

[build-system]
requires = ["hatchling"]
Expand Down Expand Up @@ -37,7 +38,10 @@ dev = [
"opacus<=1.4.0",
"syntheval>=1.6.2",
"hydra-core>=1.3.2",
"omegaconf>=2.3.0"
"omegaconf>=2.3.0",
"gower>=0.1.2",
"optuna",
"xgboost"
]
docs = [
"jinja2>=3.1.6", # Pinning version to address vulnerability GHSA-cpwx-vrp4-4pq7
Expand Down
Loading