Skip to content

Learning to Detect Language Model Training Data via Active Reconstruction

License

Notifications You must be signed in to change notification settings

oseyosey/MIA-RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning to Detect Language Model Training Data via Active Reconstruction

PaperData & Models

We propose Active Data Reconstruction Attack (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. ADRA is the first active MIA, improving over passives MIAs across all stages of model training. Results suggest that model weights encode more about training data than previous methods reveal.

Overview

This repository contains three main components:

  • adra/: Core library for membership inference attacks and reconstruction evaluation. Implements standard MIA baselines (Loss, Zlib, Min-K, Min-K++, Reference), comprehensive reconstruction metrics (lexical, embedding, LLM-as-judge), LLM-based dataset paraphrasing for datasets, and controlled contamination & model distillation.

  • verl/: RL training code based on verl for RL with GRPO, reconstruction rewards, and contrastive rewards.

  • scripts/: scripts to process data, run baselines, launch ADRA (RL) training, and evaluate MIA & reconstruction performances.

For detailed setup and usage instructions, see the README files in each subdirectory.

Set-up

All trainings & evaluations were done on a single node with 8 H200s. Hyperparameters in the scripts may need adjusting for your hardware.

Prerequisites

  • NVIDIA GPU(s) with CUDA 12.x compatible drivers
  • Conda (Miniconda or Anaconda)
  • GCC and CUDA toolkit accessible via your system or module manager

Installation

We provide two environment configs. Our paper uses both to support multiple models, so we recommend setting up both environments for reproductions.

Note that different vLLM & torch & transformers versions can produce slightly different outputs due to changes in CUDA kernels, model implementations, and scheduling optimizations. We recommend fixing one environment to run any given datasets.

adra-v1 — vLLM 0.11.0

Environment with latest model support (Olmo3). Used for OLMo 3, distillation, ablations, and some ADRA+ experiments in the paper.

conda create -n adra-v1 python=3.10
conda activate adra-v1

git clone https://github.com/oseyosey/MIA-RL.git
cd MIA-RL

bash adra_v1_setup.sh

Note: Before running the setup script, open it and update the system-specific lines at the top (GCC/CUDA module names, CUDA_HOME path, and conda path) to match your system / cluster. See requirements.txt for the full list of pinned package versions.

adra-v0 — vLLM 0.8.5.post1

Older environment used for most pre-training and post-training ADRA and ADRA+ experiments in the paper. We found that different vLLM versions can produce slightly different sampling results, so we keep this environment available for reference and reproductions.

conda create -n adra-v0 python=3.10
conda activate adra-v0

git clone https://github.com/oseyosey/MIA-RL.git
cd MIA-RL

bash adra_v0_setup.sh

ADRA Usage

Below we walk through the AIME post-training pipeline as a quick-start example. See scripts/README.md for the full step-by-step guide and per-script documentation.

Training

  1. Prepare data -- Build the MIA training parquet (member/non-member splits, lexical reward profiles, optional MIA weighting for ADRA+):

    bash scripts/post-training/aime/prepare_aime_mia_data_lexical_adra.sh        # ADRA
    bash scripts/post-training/aime/prepare_aime_mia_data_lexical_adra-plus.sh    # ADRA+
  2. Launch RL training (GRPO with lexical reward, Slurm):

    sbatch scripts/post-training/aime/submit_run_aime_adra_original_lora_h200_8.sh
    # or bash
    bash scripts/post-training/aime/submit_run_aime_adra_original_lora_h200_8.sh

Datasets and models are released at huggingface.co/ADRA-RL. You may also skip training and directly download the checkpoints for evaluation.

Evaluation

  1. MIA baselines -- Run standard attacks (loss, zlib, min-k, min-k++, ref) on the SFT model:

    bash scripts/post-training/aime/run_mia_aime_original_baselines.sh
  2. N-sampling eval -- Generate n samples from the SFT model and compute lexical MIA metrics:

    bash scripts/post-training/aime/run_mia_aime_n-sampling_eval.sh
  3. RL checkpoint eval -- Merge a LoRA checkpoint into the base model, generate, and evaluate:

    • Full sweep (loops over global steps): run_mia_aime_adra_rl_eval_full.sh
    • Quick eval (single HF checkpoint): run_mia_aime_adra_rl_eval_quick.sh

Adapting to your own dataset

We provide three dataset-agnostic boilerplate scripts at scripts/ that you can copy and fill in for a new dataset:

Script What it does
run_mia_baselines.sh Run MIA baseline attacks on any member/non-member split
run_mia_n-sampling_eval.sh Generate samples and compute lexical MIA metrics
run_mia_rl_eval_quick.sh End-to-end: merge LoRA, generate, and evaluate

Each contains TODO placeholders for paths and model IDs. See scripts/README.md for details on what to fill in.

Future Works & Discussions

We use vanilla GRPO throughout the paper. Recent work like DR GRPO, DAPO, Precision-RL has identified different failure modes of vanilla GRPO, such as training collapse, training-inference mismatch, and instability, and proposed several remedies. We've experience some of these issues during experiment but didn't have enough time to address all of the issues. These improvements could be readily incorporated into ADRA to further boost reconstruction and MIA performance.

We leave exploration of better and more robust RL algorithms to future work. Feel free to work through them and open a PR for us to merge.

Citation and Contact

If you find our work useful, please cite:

@article{yin2026learning,
  title={Learning to Detect Language Model Training Data via Active Reconstruction},
  author={Yin, Oscar Junjie and Morris, John X. and Shmatikov, Vitaly and Min, Sewon and Hajishirzi, Hannaneh},
  journal={arXiv preprint arXiv:2602.19020},
  year={2026}
}

If you have any questions, you can contact Oscar or open a github issue.

About

Learning to Detect Language Model Training Data via Active Reconstruction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published