Skip to content

zincalex/LLM-Vision-Drop

 
 

Repository files navigation

[TMLR] Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping

OpenReview TMLR 2026 Python 3.10+

Shwai He*, Guoheng Sun*, Zheyu Shen, Ang Li

📰 News⚙️ Installation📦 Layout🧰 Models📊 Benchmark📄 Citation

This is the official implementation for the paper Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping (TMLR).

📖 Introduction

This project studies architectural redundancy in Transformer-based LLMs and provides practical pipelines for:

  • Block Drop
  • Layer Drop (Attention/MLP)
  • Joint Layer Drop
  • Post-training quantization (AWQ/GPTQ)

The dropping pipeline is built on LLaMA-Factory. Quantization support is built on AutoAWQ and AutoGPTQ.

Layer-Drop.svg

📰 News

  • Feb 2026: This paper is published in Transactions on Machine Learning Research (TMLR).
  • May 2025: 🏆 Awarded the Qualcomm Innovation Fellowship (QIF) North America for the proposal “Less Attention, Much Faster: Toward a Future of Efficiency-Optimized Transformer Architectures.”
  • Nov 2024: Added support for more model families (Gemma2, Baichuan, DeepSeek, Yi, Solar).
  • Sep 2024: Released dropped-model checkpoints in this Hugging Face collection.
  • Jun 2024: Released arXiv preprint and code.

⚙️ Installation

conda create -n llm-drop python=3.10 -y
conda activate llm-drop

git clone https://github.com/CASE-Lab-UMD/LLM-Drop.git
cd LLM-Drop

# Core dropping pipeline
pip install -e .

# Quantization dependencies (optional)
cd src/llmtuner/compression/quantization/AutoAWQ
pip install -e .

cd AutoAWQ_kernels
pip install -e .

cd ../../AutoGPTQ
pip install -vvv --no-build-isolation -e .

cd ../../../../../..

📦 Repository Layout

  • src/compress.py: main entry for dropping/compression workflow.
  • scripts/dropping/*.sh: example scripts for block/layer dropping.
  • scripts/benchmark/benchmark_lm_eval.sh: LM-Eval benchmark script.
  • scripts/benchmark/benchmark_speed.sh: speed benchmark wrapper.
  • src/benchmark_speed.py: speed benchmarking implementation.
  • scripts/quantization/*.sh: AWQ/GPTQ quantization examples.

🧰 Prepare Models

  1. Download a base model from Hugging Face (for example mistralai/Mistral-7B-v0.1).
  2. Add auto_map in the model config.json so Transformers can load custom dropped-model classes.
  3. Set drop lists in config.json:
  • Drop attention layers:
"drop_mlp_list": [],
"drop_attn_list": [25, 26, 24, 22]
  • Drop MLP layers:
"drop_mlp_list": [26, 27, 25, 24],
"drop_attn_list": []
  • Drop full blocks:
"drop_mlp_list": [26, 25, 24, 27],
"drop_attn_list": [26, 25, 24, 27]

Example auto_map for Mistral:

"auto_map": {
  "AutoConfig": "configuration_dropped_mistral.MistralConfig",
  "AutoModelForCausalLM": "modeling_dropped_mistral.MistralForCausalLM"
}

See model files under src/llmtuner/compression/prune/models.

🚀 Run Dropping

# Block Drop
bash scripts/dropping/block_drop.sh

# Layer Drop
bash scripts/dropping/layer_drop.sh

# Joint Layer Drop
bash scripts/dropping/layer_drop_joint.sh

These scripts estimate module importance, select layers/blocks to drop, and generate updated model configs/checkpoints.

📊 Benchmark

🧪 1) Task Performance

bash scripts/benchmark/benchmark_lm_eval.sh

Notes:

⚡ 2) Inference Speed

bash scripts/benchmark/benchmark_speed.sh

Before running, edit placeholders in scripts/benchmark/benchmark_speed.sh:

  • model_path
  • save_file
  • model_type

🧊 3) Quantization

bash scripts/quantization/awq.sh
bash scripts/quantization/gptq.sh

Before running, edit placeholders in those scripts (model_path, quant_path) and ensure CUDA-compatible package versions.

📄 Citation

@article{
    he2026uncovering,
    title={Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping},
    author={Shwai He and Guoheng Sun and Zheyu Shen and Ang Li},
    journal={Transactions on Machine Learning Research},
    issn={2835-8856},
    year={2026},
    url={https://openreview.net/forum?id=1I7PCbOPfe},
    note={}
}

📬 Contact

  • Shwai He: shwaihe@umd.edu
  • Guoheng Sun: ghsun@umd.edu

About

The official implementation of the paper "What Matters in Transformers? Not All Attention is Needed".

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 72.9%
  • Cuda 18.0%
  • C++ 8.1%
  • Other 1.0%