Code used for Distillation Robustifies Unlearning. /src directory and run-*.py host all runnable scripts.
Large language models can acquire undesirable capabilities during pretraining that complicate model deployment. Machine unlearning offers one approach to this challenge by attempting to remove these capabilities, but current methods only offer surface-level suppression that can be easily reversed through finetuning. We show that distilling unlearned models into randomly initialized students enables robust capability removal. However, full distillation is computationally expensive for large models. We address this with Unlearn-Noise-Distill-on-Outputs (UNDO), which approximates full distillation by copying and noising the weights of an unlearned teacher model. Using this approach, we demonstrate robust unlearning across synthetic language and arithmetic tasks: UNDO achieves Pareto-optimal performance while matching gold-standard data filtering robustness at a fraction of the compute cost, and successfully robustifies unlearning on the more realistic WMDP benchmark. Given that distillation is already widely used, adding an unlearning step beforehand enables robust capability removal at little extra cost.
For users who want to run a minimal example on language tasks:
- Complete the Environment Setup
- Follow Set up only language instructions
- Run scripts in the following sequence: pretrain, unlearn, partial-distill, relearn
This will get you up and running with the core functionality on a single task type.
- Python 3.8+
- CUDA-compatible GPU(s) recommended. Params set for H200s. For GPU's with less GPU memory, try reducing batch size and increasing gradient accumulation by the same factor.
- All scripts are meant to be run from distillation-robustifies-unlearning directory.
- Most run_* scripts will automatically run on all available GPUs, running several processes in parallel or sequentially as available until all specified settings have been run. To restrict the GPU's, precede the command with
CUDA_VISIBLE_DEVICES={desired devices}.
git clone https://github.com/AddieFoote/distillation-robustify-unlearningpip install uvcd distillation-robustifies-unlearninguv syncsource .venv/bin/activate
- Add a huggingface token to
tokens/hf_token.txtand a wandb token totokens/wandb_token.txt python src/prepare_models/reduce_gemma.pypython src/prepare_data/download_datasets.pypython src/prepare_data/download_arithmetic.py- Contact us for the WMDP question-answer datasets that were generated via
wmdp_question_extraction.py python src/prepare_data/prepare.py
Run all steps above, but before running step 3, open the file src/prepare_data/download_datasets.py and comment out all calls to download_dataset except those that are indicated as required for language setting in the comments (the final two). This will substantially speed up steps 3 and 6. Skip steps 4 and 5.
Run all steps above, but before running step 3, open the file src/prepare_data/download_datasets.py and comment out all calls to download_dataset except those that are indicated as required for arithmetic setting in the comments (the second to last). This will substantially speed up steps 3 and 6. Skip step 5.
Run all steps above, but before running step 3, open the file src/prepare_data/download_datasets.py and comment out all calls to download_dataset except those that are indicated as required for WMDP setting in the comments (only the last is not required). This will speed up steps 3 and 6. Skip step 4.
All scripts can be run using
python run_...py
@misc{
lee2025distillationrobustifiesunlearning,
title={Distillation Robustifies Unlearning},
author={Bruce W. Lee and Addie Foote and Alex Infanger and Leni Shor and Harish Kamath and Jacob Goldman-Wetzler and Bryce Woodworth and Alex Cloud and Alexander Matt Turner},
year={2025},
eprint={2506.06278},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.06278},
}