Official codebase for Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods [ArXiv] [ICLR 2024 Workshop on ME-FoMo][Blog Post].
- We perform one-shot pruning by removing unimportant Transformer blocks in LLMs. Compared to recent baselines, our depth pruning achieves faster inference while yielding comparable or superior performance.
- In retraining pruned models for quality recovery, continued pretraining (CPT) on a large corpus markedly outperforms LoRA-based tuning, particularly at severe pruning ratios.
conda create -n shortened-llm python=3.9
conda activate shortened-llm
git clone https://github.com/Nota-NetsPresso/shortened-llm.git
cd shortened-llm
pip install -r requirement.txt
Note on package versions:
(optional) GPTQ Support:
- Post-training quantization can be further applied to our pruned model.
- We applied GPTQ on the pruned & re-trained models.
- repo: AutoGPTQ version
0.7.1
- repo: AutoGPTQ version
- To install the required packages, we recommend installation from source as follows:
git clone https://github.com/AutoGPTQ/AutoGPTQ.git cd AutoGPTQ git checkout v0.7.1 pip install -vvv -e .
Source Model |
Pruning Ratio |
Pruning Criterion |
🤗Hugging Face Link |
---|---|---|---|
Vicuna-v1.3-7B | 20% | PPL | nota-ai/cpt_st-vicuna-v1.3-5.5b-ppl |
Vicuna-v1.3-7B | 45% | PPL | nota-ai/cpt_st-vicuna-v1.3-3.7b-ppl |
Vicuna-v1.3-7B | 60% | PPL | nota-ai/cpt_st-vicuna-v1.3-2.7b-ppl |
Vicuna-v1.3-7B | 80% | PPL | nota-ai/cpt_st-vicuna-v1.3-1.5b-ppl |
Source Model |
Pruning Ratio |
Pruning Criterion |
🤗Hugging Face Link |
---|---|---|---|
LLaMA-1-7B | 20% | PPL | nota-ai/st-llama-1-5.5b-ppl |
LLaMA-1-7B | 20% | Taylor+ | nota-ai/st-llama-1-5.5b-taylor |
Vicuna-v1.3-7B | 20% | PPL | nota-ai/st-vicuna-v1.3-5.5b-ppl |
Vicuna-v1.3-7B | 20% | Taylor+ | nota-ai/st-vicuna-v1.3-5.5b-taylor |
Vicuna-v1.3-13B | 21% | PPL | nota-ai/st-vicuna-v1.3-10.5b-ppl |
Vicuna-v1.3-13B | 21% | Taylor+ | nota-ai/st-vicuna-v1.3-10.5b-taylor |
The scripts perform (1) block pruning ➔ (2) LoRA-based retraining ➔ (3) zero-shot evaluation.
- Pruning criterion: PPL (top); Taylor+ (bottom).
- LLaMA-1-7b (based on
LlamaForCausalLM
)bash script/prune_llama-7b_crit-ppl.sh bash script/prune_llama-7b_crit-taylor.sh
- Llama-2-7b (based on
LlamaForCausalLM
)bash script/prune_llama2-7b_crit-ppl.sh bash script/prune_llama2-7b_crit-taylor.sh
- Llama-3-8B (based on
LlamaForCausalLM
)bash script/prune_llama3-8b_crit-ppl.sh bash script/prune_llama3-8b_crit-taylor.sh
- Vicuna-7b-v1.3 (based on
LlamaForCausalLM
)bash script/prune_vicuna-7b_crit-ppl.sh bash script/prune_vicuna-7b_crit-taylor.sh
- Vicuna-13b-v1.3 (based on
LlamaForCausalLM
)bash script/prune_vicuna-13b_crit-ppl.sh bash script/prune_vicuna-13b_crit-taylor.sh
- CatPPT-base (based on
MistralForCausalLM
)bash script/prune_CatPPT_crit-ppl.sh bash script/prune_CatPPT_crit-taylor.sh
- Gemma-2b (based on
GemmaForCausalLM
)bash script/prune_gemma-2b_crit-ppl_yesBOS.sh bash script/prune_gemma-2b_crit-taylor_yesBOS.sh
- Gemma-7b (based on
GemmaForCausalLM
)bash script/prune_gemma-7b_crit-ppl_yesBOS.sh bash script/prune_gemma-7b_crit-taylor_yesBOS.sh
-
To test other pruning ratios, use:
bash script/prune.sh
-
To obtain baselines using the magnitude pruning criterion, use:
bash script/prune_llama-7b_crit-magnitude.sh bash script/prune_vicuna-7b_crit-magnitude.sh bash script/prune_vicuna-13b_crit-magnitude.sh
-
To measure (1) PPL on WikiText2 & PTB, and (2) accuracy on seven commonsense reasoning tasks, use: (EleutherAI/lm-evaluation-harness version 3326c54)
bash script/evaluate.sh
-
(Optional) Any post-training quantization method can be applied to our pruned models. The example script quantizes our pruned models using GPTQ and measures their performance with
script/evaluate.sh
:bash script/quantize_gptq_vicuna-7b.sh
-
To measure latency & throughput, use:
bash script/measure_time.sh
-
To measure VRAM requirements, use:
bash script/measure_vram.sh
-
To measure GPU compute utilization, use:
bash script/measure_gpuutil.sh
The demo compares the use of LLM-Pruner (Ma et al., 2023; width pruning) and Shortened LLaMA (Ours; depth pruning) for the LLaMA-1-7B model:
pip install transformers==4.33.1 # to run LLM-Pruner's model
python src/app.py
- All rights related to this repository and the compressed models are reserved by Nota Inc.
- The intended use is strictly limited to research and non-commercial projects.
- Microsoft for Startups Founders Hub and Gwangju AICA for generously providing GPU resources.
- LLM-Pruner, which utilizes LM Evaluation Harness, PEFT, and Alpaca-LoRA. Thanks for the pioneering work on structured pruning of LLMs!
- LLaMA, Vicuna, and SlimPajama. Thanks for the open-source LLMs and data!
@article{kim2024shortened,
title={Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods},
author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
journal={arXiv preprint arXiv:2402.02834},
year={2024},
url={https://arxiv.org/abs/2402.02834}
}
@article{kim2024mefomo,
title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
journal={ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)},
year={2024},
url={https://openreview.net/forum?id=18VGxuOdpu}
}