We propose Active Data Reconstruction Attack (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. ADRA is the first active MIA, improving over passives MIAs across all stages of model training. Results suggest that model weights encode more about training data than previous methods reveal.
This repository contains three main components:
-
adra/: Core library for membership inference attacks and reconstruction evaluation. Implements standard MIA baselines (Loss, Zlib, Min-K, Min-K++, Reference), comprehensive reconstruction metrics (lexical, embedding, LLM-as-judge), LLM-based dataset paraphrasing for datasets, and controlled contamination & model distillation. -
verl/: RL training code based on verl for RL with GRPO, reconstruction rewards, and contrastive rewards.verl/examples/data_preprocess/: Prepares candidate data pools (e.g. BookMIA, AIME) into RL-ready training data.verl/verl/utils/reward_score/: Reward functions including lexical reconstruction, embedding similarity, and LLM-as-judge rewards with contrastive reward formulation.
-
scripts/: scripts to process data, run baselines, launch ADRA (RL) training, and evaluate MIA & reconstruction performances.
For detailed setup and usage instructions, see the README files in each subdirectory.
All trainings & evaluations were done on a single node with 8 H200s. Hyperparameters in the scripts may need adjusting for your hardware.
- NVIDIA GPU(s) with CUDA 12.x compatible drivers
- Conda (Miniconda or Anaconda)
- GCC and CUDA toolkit accessible via your system or module manager
We provide two environment configs. Our paper uses both to support multiple models, so we recommend setting up both environments for reproductions.
Note that different vLLM & torch & transformers versions can produce slightly different outputs due to changes in CUDA kernels, model implementations, and scheduling optimizations. We recommend fixing one environment to run any given datasets.
Environment with latest model support (Olmo3). Used for OLMo 3, distillation, ablations, and some ADRA+ experiments in the paper.
conda create -n adra-v1 python=3.10
conda activate adra-v1
git clone https://github.com/oseyosey/MIA-RL.git
cd MIA-RL
bash adra_v1_setup.shNote: Before running the setup script, open it and update the system-specific lines at the top (GCC/CUDA module names,
CUDA_HOMEpath, and conda path) to match your system / cluster. Seerequirements.txtfor the full list of pinned package versions.
Older environment used for most pre-training and post-training ADRA and ADRA+ experiments in the paper. We found that different vLLM versions can produce slightly different sampling results, so we keep this environment available for reference and reproductions.
conda create -n adra-v0 python=3.10
conda activate adra-v0
git clone https://github.com/oseyosey/MIA-RL.git
cd MIA-RL
bash adra_v0_setup.shBelow we walk through the AIME post-training pipeline as a quick-start example. See scripts/README.md for the full step-by-step guide and per-script documentation.
-
Prepare data -- Build the MIA training parquet (member/non-member splits, lexical reward profiles, optional MIA weighting for ADRA+):
bash scripts/post-training/aime/prepare_aime_mia_data_lexical_adra.sh # ADRA bash scripts/post-training/aime/prepare_aime_mia_data_lexical_adra-plus.sh # ADRA+
-
Launch RL training (GRPO with lexical reward, Slurm):
sbatch scripts/post-training/aime/submit_run_aime_adra_original_lora_h200_8.sh # or bash bash scripts/post-training/aime/submit_run_aime_adra_original_lora_h200_8.sh
Datasets and models are released at huggingface.co/ADRA-RL. You may also skip training and directly download the checkpoints for evaluation.
-
MIA baselines -- Run standard attacks (loss, zlib, min-k, min-k++, ref) on the SFT model:
bash scripts/post-training/aime/run_mia_aime_original_baselines.sh
-
N-sampling eval -- Generate
nsamples from the SFT model and compute lexical MIA metrics:bash scripts/post-training/aime/run_mia_aime_n-sampling_eval.sh
-
RL checkpoint eval -- Merge a LoRA checkpoint into the base model, generate, and evaluate:
- Full sweep (loops over global steps):
run_mia_aime_adra_rl_eval_full.sh - Quick eval (single HF checkpoint):
run_mia_aime_adra_rl_eval_quick.sh
- Full sweep (loops over global steps):
We provide three dataset-agnostic boilerplate scripts at scripts/ that you can copy and fill in for a new dataset:
| Script | What it does |
|---|---|
run_mia_baselines.sh |
Run MIA baseline attacks on any member/non-member split |
run_mia_n-sampling_eval.sh |
Generate samples and compute lexical MIA metrics |
run_mia_rl_eval_quick.sh |
End-to-end: merge LoRA, generate, and evaluate |
Each contains TODO placeholders for paths and model IDs. See scripts/README.md for details on what to fill in.
We use vanilla GRPO throughout the paper. Recent work like DR GRPO, DAPO, Precision-RL has identified different failure modes of vanilla GRPO, such as training collapse, training-inference mismatch, and instability, and proposed several remedies. We've experience some of these issues during experiment but didn't have enough time to address all of the issues. These improvements could be readily incorporated into ADRA to further boost reconstruction and MIA performance.
We leave exploration of better and more robust RL algorithms to future work. Feel free to work through them and open a PR for us to merge.
If you find our work useful, please cite:
@article{yin2026learning,
title={Learning to Detect Language Model Training Data via Active Reconstruction},
author={Yin, Oscar Junjie and Morris, John X. and Shmatikov, Vitaly and Min, Sewon and Hajishirzi, Hannaneh},
journal={arXiv preprint arXiv:2602.19020},
year={2026}
}If you have any questions, you can contact Oscar or open a github issue.
