[TMLR] Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping

Shwai He*, Guoheng Sun*, Zheyu Shen, Ang Li

📰 News • ⚙️ Installation • 📦 Layout • 🧰 Models • 📊 Benchmark • 📄 Citation

This is the official implementation for the paper Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping (TMLR).

📖 Introduction

This project studies architectural redundancy in Transformer-based LLMs and provides practical pipelines for:

Block Drop
Layer Drop (Attention/MLP)
Joint Layer Drop
Post-training quantization (AWQ/GPTQ)

The dropping pipeline is built on LLaMA-Factory. Quantization support is built on AutoAWQ and AutoGPTQ.

📰 News

Feb 2026: This paper is published in Transactions on Machine Learning Research (TMLR).
May 2025: 🏆 Awarded the Qualcomm Innovation Fellowship (QIF) North America for the proposal “Less Attention, Much Faster: Toward a Future of Efficiency-Optimized Transformer Architectures.”
Nov 2024: Added support for more model families (Gemma2, Baichuan, DeepSeek, Yi, Solar).
Sep 2024: Released dropped-model checkpoints in this Hugging Face collection.
Jun 2024: Released arXiv preprint and code.

⚙️ Installation

conda create -n llm-drop python=3.10 -y
conda activate llm-drop

git clone https://github.com/CASE-Lab-UMD/LLM-Drop.git
cd LLM-Drop

# Core dropping pipeline
pip install -e .

# Quantization dependencies (optional)
cd src/llmtuner/compression/quantization/AutoAWQ
pip install -e .

cd AutoAWQ_kernels
pip install -e .

cd ../../AutoGPTQ
pip install -vvv --no-build-isolation -e .

cd ../../../../../..

📦 Repository Layout

src/compress.py: main entry for dropping/compression workflow.
scripts/dropping/*.sh: example scripts for block/layer dropping.
scripts/benchmark/benchmark_lm_eval.sh: LM-Eval benchmark script.
scripts/benchmark/benchmark_speed.sh: speed benchmark wrapper.
src/benchmark_speed.py: speed benchmarking implementation.
scripts/quantization/*.sh: AWQ/GPTQ quantization examples.

🧰 Prepare Models

Download a base model from Hugging Face (for example mistralai/Mistral-7B-v0.1).
Add auto_map in the model config.json so Transformers can load custom dropped-model classes.
Set drop lists in config.json:

Drop attention layers:

"drop_mlp_list": [],
"drop_attn_list": [25, 26, 24, 22]

Drop MLP layers:

"drop_mlp_list": [26, 27, 25, 24],
"drop_attn_list": []

Drop full blocks:

"drop_mlp_list": [26, 25, 24, 27],
"drop_attn_list": [26, 25, 24, 27]

Example auto_map for Mistral:

"auto_map": {
  "AutoConfig": "configuration_dropped_mistral.MistralConfig",
  "AutoModelForCausalLM": "modeling_dropped_mistral.MistralForCausalLM"
}

See model files under src/llmtuner/compression/prune/models.

🚀 Run Dropping

# Block Drop
bash scripts/dropping/block_drop.sh

# Layer Drop
bash scripts/dropping/layer_drop.sh

# Joint Layer Drop
bash scripts/dropping/layer_drop_joint.sh

These scripts estimate module importance, select layers/blocks to drop, and generate updated model configs/checkpoints.

📊 Benchmark

🧪 1) Task Performance

bash scripts/benchmark/benchmark_lm_eval.sh

Notes:

This benchmark depends on EleutherAI/lm-evaluation-harness.
For strict reproduction, the repo uses this fork: s1ghhh/lm-evaluation-harness.
Use modeling files in src/llmtuner/model when loading Mistral/Llama with dropped configs.

⚡ 2) Inference Speed

bash scripts/benchmark/benchmark_speed.sh

Before running, edit placeholders in scripts/benchmark/benchmark_speed.sh:

model_path
save_file
model_type

🧊 3) Quantization

bash scripts/quantization/awq.sh
bash scripts/quantization/gptq.sh

Before running, edit placeholders in those scripts (model_path, quant_path) and ensure CUDA-compatible package versions.

📄 Citation

@article{
    he2026uncovering,
    title={Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping},
    author={Shwai He and Guoheng Sun and Zheyu Shen and Ang Li},
    journal={Transactions on Machine Learning Research},
    issn={2835-8856},
    year={2026},
    url={https://openreview.net/forum?id=1I7PCbOPfe},
    note={}
}

📬 Contact

Shwai He: shwaihe@umd.edu
Guoheng Sun: ghsun@umd.edu

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
scripts		scripts
src		src
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Layer_Drop.svg		Layer_Drop.svg
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[TMLR] Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping

📖 Introduction

📰 News

⚙️ Installation

📦 Repository Layout

🧰 Prepare Models

🚀 Run Dropping

📊 Benchmark

🧪 1) Task Performance

⚡ 2) Inference Speed

🧊 3) Quantization

📄 Citation

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[TMLR] Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping

📖 Introduction

📰 News

⚙️ Installation

📦 Repository Layout

🧰 Prepare Models

🚀 Run Dropping

📊 Benchmark

🧪 1) Task Performance

⚡ 2) Inference Speed

🧊 3) Quantization

📄 Citation

📬 Contact

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages