Shortened LLM by Nota AI

Official codebase for Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods [ArXiv] [ICLR 2024 Workshop on ME-FoMo][Blog Post].

We perform one-shot pruning by removing unimportant Transformer blocks in LLMs. Compared to recent baselines, our depth pruning achieves faster inference while yielding comparable or superior performance.
In retraining pruned models for quality recovery, continued pretraining (CPT) on a large corpus markedly outperforms LoRA-based tuning, particularly at severe pruning ratios.

Installation

conda create -n shortened-llm python=3.9
conda activate shortened-llm
git clone https://github.com/Nota-NetsPresso/shortened-llm.git
cd shortened-llm
pip install -r requirement.txt

Note on package versions:

Part of the below repositories is included for evaluation:
- src/LLMPruner: horseee/LLM-Pruner version 213ffa4
- src/lm_eval: EleutherAI/lm-evaluation-harness version 3326c54
Torch version used in our experiments: 2.0.1 for RTX3090 & A100; 2.1.1 for H100.

(optional) GPTQ Support:

Post-training quantization can be further applied to our pruned model.
We applied GPTQ on the pruned & re-trained models.
- repo: AutoGPTQ version 0.7.1

To install the required packages, we recommend installation from source as follows:

git clone https://github.com/AutoGPTQ/AutoGPTQ.git
cd AutoGPTQ
git checkout v0.7.1
pip install -vvv -e .

Models from Aggressive Pruning & CPT Retraining (arXiv-v2):

Source Model	Pruning Ratio	Pruning Criterion	🤗Hugging Face Link
Vicuna-v1.3-7B	20%	PPL	nota-ai/cpt_st-vicuna-v1.3-5.5b-ppl
Vicuna-v1.3-7B	45%	PPL	nota-ai/cpt_st-vicuna-v1.3-3.7b-ppl
Vicuna-v1.3-7B	60%	PPL	nota-ai/cpt_st-vicuna-v1.3-2.7b-ppl
Vicuna-v1.3-7B	80%	PPL	nota-ai/cpt_st-vicuna-v1.3-1.5b-ppl

Click to see the results:

EleutherAI/lm-evaluation-harness version 3326c54

Models from Moderate Pruning & LoRA Retraining (arXiv-v1):

Source Model	Pruning Ratio	Pruning Criterion	🤗Hugging Face Link
LLaMA-1-7B	20%	PPL	nota-ai/st-llama-1-5.5b-ppl
LLaMA-1-7B	20%	Taylor+	nota-ai/st-llama-1-5.5b-taylor
Vicuna-v1.3-7B	20%	PPL	nota-ai/st-vicuna-v1.3-5.5b-ppl
Vicuna-v1.3-7B	20%	Taylor+	nota-ai/st-vicuna-v1.3-5.5b-taylor
Vicuna-v1.3-13B	21%	PPL	nota-ai/st-vicuna-v1.3-10.5b-ppl
Vicuna-v1.3-13B	21%	Taylor+	nota-ai/st-vicuna-v1.3-10.5b-taylor

Click to see the results:

EleutherAI/lm-evaluation-harness version 3326c54

Examples

The scripts perform (1) block pruning ➔ (2) LoRA-based retraining ➔ (3) zero-shot evaluation.

Pruning criterion: PPL (top); Taylor+ (bottom).

LLaMA-1-7b (based on LlamaForCausalLM)

bash script/prune_llama-7b_crit-ppl.sh
bash script/prune_llama-7b_crit-taylor.sh

Llama-2-7b (based on LlamaForCausalLM)

bash script/prune_llama2-7b_crit-ppl.sh
bash script/prune_llama2-7b_crit-taylor.sh

Llama-3-8B (based on LlamaForCausalLM)

bash script/prune_llama3-8b_crit-ppl.sh
bash script/prune_llama3-8b_crit-taylor.sh

Vicuna-7b-v1.3 (based on LlamaForCausalLM)

bash script/prune_vicuna-7b_crit-ppl.sh
bash script/prune_vicuna-7b_crit-taylor.sh

Vicuna-13b-v1.3 (based on LlamaForCausalLM)

bash script/prune_vicuna-13b_crit-ppl.sh
bash script/prune_vicuna-13b_crit-taylor.sh

CatPPT-base (based on MistralForCausalLM)

bash script/prune_CatPPT_crit-ppl.sh
bash script/prune_CatPPT_crit-taylor.sh

Gemma-2b (based on GemmaForCausalLM)

bash script/prune_gemma-2b_crit-ppl_yesBOS.sh
bash script/prune_gemma-2b_crit-taylor_yesBOS.sh

Gemma-7b (based on GemmaForCausalLM)

bash script/prune_gemma-7b_crit-ppl_yesBOS.sh
bash script/prune_gemma-7b_crit-taylor_yesBOS.sh

Other Scripts

To test other pruning ratios, use:
```
bash script/prune.sh
```

To obtain baselines using the magnitude pruning criterion, use:

bash script/prune_llama-7b_crit-magnitude.sh
bash script/prune_vicuna-7b_crit-magnitude.sh
bash script/prune_vicuna-13b_crit-magnitude.sh

To measure (1) PPL on WikiText2 & PTB, and (2) accuracy on seven commonsense reasoning tasks, use: (EleutherAI/lm-evaluation-harness version 3326c54)
```
bash script/evaluate.sh
```
(Optional) Any post-training quantization method can be applied to our pruned models. The example script quantizes our pruned models using GPTQ and measures their performance with script/evaluate.sh:
```
bash script/quantize_gptq_vicuna-7b.sh
```
To measure latency & throughput, use:
```
bash script/measure_time.sh
```
To measure VRAM requirements, use:
```
bash script/measure_vram.sh
```
To measure GPU compute utilization, use:
```
bash script/measure_gpuutil.sh
```

Gradio Demo: Width✄ vs. Depth✄

The demo compares the use of LLM-Pruner (Ma et al., 2023; width pruning) and Shortened LLaMA (Ours; depth pruning) for the LLaMA-1-7B model:

pip install transformers==4.33.1 # to run LLM-Pruner's model
python src/app.py

Click to see a demo screenshot (on an A100 80GB GPU):

License

All rights related to this repository and the compressed models are reserved by Nota Inc.
The intended use is strictly limited to research and non-commercial projects.

Acknowledgments

Microsoft for Startups Founders Hub and Gwangju AICA for generously providing GPU resources.
LLM-Pruner, which utilizes LM Evaluation Harness, PEFT, and Alpaca-LoRA. Thanks for the pioneering work on structured pruning of LLMs!
LLaMA, Vicuna, and SlimPajama. Thanks for the open-source LLMs and data!

Citation

@article{kim2024shortened,
  title={Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={arXiv preprint arXiv:2402.02834},      
  year={2024},
  url={https://arxiv.org/abs/2402.02834}
}

@article{kim2024mefomo,
  title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)},
  year={2024},
  url={https://openreview.net/forum?id=18VGxuOdpu}
}

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github		.github
output_block_sensitivity		output_block_sensitivity
results		results
script		script
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Shortened LLM by Nota AI

Installation

Models from Aggressive Pruning & CPT Retraining (arXiv-v2):

Models from Moderate Pruning & LoRA Retraining (arXiv-v1):

Examples

Other Scripts

Gradio Demo: Width✄ vs. Depth✄

License

Acknowledgments

Citation

About

Contributors 7

Languages

Nota-NetsPresso/shortened-llm

Folders and files

Latest commit

History

Repository files navigation

Shortened LLM by Nota AI

Installation

Models from Aggressive Pruning & CPT Retraining (arXiv-v2):

Models from Moderate Pruning & LoRA Retraining (arXiv-v1):

Examples

Other Scripts

Gradio Demo: Width✄ vs. Depth✄

License

Acknowledgments

Citation

About

Topics

Resources

Stars

Watchers

Forks

Contributors 7

Languages