Learning to Detect Language Model Training Data via Active Reconstruction

We propose Active Data Reconstruction Attack (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. ADRA is the first active MIA, improving over passives MIAs across all stages of model training. Results suggest that model weights encode more about training data than previous methods reveal.

Overview

This repository contains three main components:

adra/: Core library for membership inference attacks and reconstruction evaluation. Implements standard MIA baselines (Loss, Zlib, Min-K, Min-K++, Reference), comprehensive reconstruction metrics (lexical, embedding, LLM-as-judge), LLM-based dataset paraphrasing for datasets, and controlled contamination & model distillation.
verl/: RL training code based on verl for RL with GRPO, reconstruction rewards, and contrastive rewards.
- verl/examples/data_preprocess/: Prepares candidate data pools (e.g. BookMIA, AIME) into RL-ready training data.
- verl/verl/utils/reward_score/: Reward functions including lexical reconstruction, embedding similarity, and LLM-as-judge rewards with contrastive reward formulation.
scripts/: scripts to process data, run baselines, launch ADRA (RL) training, and evaluate MIA & reconstruction performances.

For detailed setup and usage instructions, see the README files in each subdirectory.

Set-up

All trainings & evaluations were done on a single node with 8 H200s. Hyperparameters in the scripts may need adjusting for your hardware.

Prerequisites

NVIDIA GPU(s) with CUDA 12.x compatible drivers
Conda (Miniconda or Anaconda)
GCC and CUDA toolkit accessible via your system or module manager

Installation

We provide two environment configs. Our paper uses both to support multiple models, so we recommend setting up both environments for reproductions.

Note that different vLLM & torch & transformers versions can produce slightly different outputs due to changes in CUDA kernels, model implementations, and scheduling optimizations. We recommend fixing one environment to run any given datasets.

`adra-v1` — vLLM 0.11.0

Environment with latest model support (Olmo3). Used for OLMo 3, distillation, ablations, and some ADRA+ experiments in the paper.

conda create -n adra-v1 python=3.10
conda activate adra-v1

git clone https://github.com/oseyosey/MIA-RL.git
cd MIA-RL

bash adra_v1_setup.sh

Note: Before running the setup script, open it and update the system-specific lines at the top (GCC/CUDA module names, CUDA_HOME path, and conda path) to match your system / cluster. See requirements.txt for the full list of pinned package versions.

`adra-v0` — vLLM 0.8.5.post1

Older environment used for most pre-training and post-training ADRA and ADRA+ experiments in the paper. We found that different vLLM versions can produce slightly different sampling results, so we keep this environment available for reference and reproductions.

conda create -n adra-v0 python=3.10
conda activate adra-v0

git clone https://github.com/oseyosey/MIA-RL.git
cd MIA-RL

bash adra_v0_setup.sh

ADRA Usage

Below we walk through the AIME post-training pipeline as a quick-start example. See scripts/README.md for the full step-by-step guide and per-script documentation.

Training

Prepare data -- Build the MIA training parquet (member/non-member splits, lexical reward profiles, optional MIA weighting for ADRA+):

bash scripts/post-training/aime/prepare_aime_mia_data_lexical_adra.sh        # ADRA
bash scripts/post-training/aime/prepare_aime_mia_data_lexical_adra-plus.sh    # ADRA+

Launch RL training (GRPO with lexical reward, Slurm):

sbatch scripts/post-training/aime/submit_run_aime_adra_original_lora_h200_8.sh
# or bash
bash scripts/post-training/aime/submit_run_aime_adra_original_lora_h200_8.sh

Datasets and models are released at huggingface.co/ADRA-RL. You may also skip training and directly download the checkpoints for evaluation.

Evaluation

MIA baselines -- Run standard attacks (loss, zlib, min-k, min-k++, ref) on the SFT model:
```
bash scripts/post-training/aime/run_mia_aime_original_baselines.sh
```
N-sampling eval -- Generate n samples from the SFT model and compute lexical MIA metrics:
```
bash scripts/post-training/aime/run_mia_aime_n-sampling_eval.sh
```
RL checkpoint eval -- Merge a LoRA checkpoint into the base model, generate, and evaluate:
- Full sweep (loops over global steps): run_mia_aime_adra_rl_eval_full.sh
- Quick eval (single HF checkpoint): run_mia_aime_adra_rl_eval_quick.sh

Adapting to your own dataset

We provide three dataset-agnostic boilerplate scripts at scripts/ that you can copy and fill in for a new dataset:

Script	What it does
`run_mia_baselines.sh`	Run MIA baseline attacks on any member/non-member split
`run_mia_n-sampling_eval.sh`	Generate samples and compute lexical MIA metrics
`run_mia_rl_eval_quick.sh`	End-to-end: merge LoRA, generate, and evaluate

Each contains TODO placeholders for paths and model IDs. See scripts/README.md for details on what to fill in.

Future Works & Discussions

We use vanilla GRPO throughout the paper. Recent work like DR GRPO, DAPO, Precision-RL has identified different failure modes of vanilla GRPO, such as training collapse, training-inference mismatch, and instability, and proposed several remedies. We've experience some of these issues during experiment but didn't have enough time to address all of the issues. These improvements could be readily incorporated into ADRA to further boost reconstruction and MIA performance.

We leave exploration of better and more robust RL algorithms to future work. Feel free to work through them and open a PR for us to merge.

Citation and Contact

If you find our work useful, please cite:

@article{yin2026learning,
  title={Learning to Detect Language Model Training Data via Active Reconstruction},
  author={Yin, Oscar Junjie and Morris, John X. and Shmatikov, Vitaly and Min, Sewon and Hajishirzi, Hannaneh},
  journal={arXiv preprint arXiv:2602.19020},
  year={2026}
}

If you have any questions, you can contact Oscar or open a github issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning to Detect Language Model Training Data via Active Reconstruction

Overview

Set-up

Prerequisites

Installation

`adra-v1` — vLLM 0.11.0

`adra-v0` — vLLM 0.8.5.post1

ADRA Usage

Training

Evaluation

Adapting to your own dataset

Future Works & Discussions

Citation and Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
adra		adra
scripts		scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
adra_v0_setup.sh		adra_v0_setup.sh
adra_v1_setup.sh		adra_v1_setup.sh
requirements.txt		requirements.txt
setup.py		setup.py

License

oseyosey/MIA-RL

Folders and files

Latest commit

History

Repository files navigation

Learning to Detect Language Model Training Data via Active Reconstruction

Overview

Set-up

Prerequisites

Installation

adra-v1 — vLLM 0.11.0

adra-v0 — vLLM 0.8.5.post1

ADRA Usage

Training

Evaluation

Adapting to your own dataset

Future Works & Discussions

Citation and Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`adra-v1` — vLLM 0.11.0

`adra-v0` — vLLM 0.8.5.post1

Packages