This project is an unofficial implementation of the paper Compact Language Models via Pruning and Knowledge Distillation. It explores techniques for compressing large language models (LLMs) through a combination of pruning and knowledge distillation.
The goal of this project is to investigate whether pruning an existing LLM and then re-training it with a small fraction of the original training data can be a viable alternative to training each model variant from scratch. The implementation focuses on:
- Pruning strategies for width, attention, and MLP layers
- Combining different pruning axes
- Knowledge distillation techniques for retraining
- Searching for optimal compressed architectures
models.py
: Contains the implementation of the GPT model and its componentshooks.py
: Implements forward hooks for calculating importance scorespruners.py
: Contains functions for pruning neurons, attention heads, and embeddingsutils.py
: Utility functions for data loading, model saving/loading, and evaluationscript.py
: Main script for running experiments
- Clone the repository
- Install the required dependencies (from
pyproject.toml
file) - Download the training data (Shakespeare dataset) by running the script
- Adjust hyperparameters in
script.py
as needed - Run
script.py
to train the base model and perform pruning experiments
- Implementation of a GPT-style language model
- Flexible pruning strategies for different model components
- Knowledge distillation for model retraining
- Experimental framework for testing various compression configurations
The implementation doesn't support any kind of CLI usage, I kind of got focused on the math heavy stuff.
(work in progress)
- This implementation currently focuses on a smaller scale model compared to the paper (like, a few thousand times smaller since I don't got any GPUs?)
- Further optimization of pruning and distillation techniques may be possible (didn't implement depth pruning as my focus is applying the technique on smaller models <15B)
@article{minitron2024,
title={Compact Language Models via Pruning and Knowledge Distillation},
author={Saurav Muralidharan and Sharath Turuvekere Sreenivas and Raviraj Joshi and Marcin Chochowski and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Jan Kautz and Pavlo Molchanov},
journal={arXiv preprint arXiv:2407.14679},
year={2024},
url={https://arxiv.org/abs/2407.14679},
}
- Andrej Karpathy for literally firing me up for working on my FOMO.