CascadeExit: Adaptive Early-Exit Speculative Decoding for LLM Inference Acceleration

Post-hoc method for accelerating LLM inference through confidence-calibrated early exit at intermediate transformer layers. Trains lightweight SwiGLU adapter modules at select layers using knowledge distillation, with learned confidence estimators forming a cascade routing strategy from cheapest to most expensive computation.

Key Results

Important Limitation: The cascade approach achieves a 1.76x speedup but at a significant quality cost. Quality verification shows only a 20% match rate between cascade outputs and full-model outputs — meaning 80% of generated sequences differ from what the full model would produce. The speedup comes from routing most tokens (65%) through the shallowest exit (Layer 8), which has only 41.4% top-1 accuracy and 7.2x higher perplexity than the full model. Users should carefully evaluate whether this quality trade-off is acceptable for their use case.

Configuration	Speedup	Parameter Overhead
CascadeExit (L8/L16/L22)	1.76x	0.51% (16.5M params)
Best Speculative (L22, K=3)	0.84x	Same adapters
Standard Decoding	1.00x	Baseline

Token Routing Distribution (Cascade)

Exit Layer	Depth	Tokens Routed	Top-1 Accuracy	Perplexity
Layer 8	29%	65.1%	41.4%	74.18
Layer 16	57%	24.5%	54.4%	32.25
Layer 22	79%	7.5%	67.4%	18.06
Full Model	100%	2.9%	100%	10.26

Approach

Post-hoc adapter training: SwiGLU exit adapters at layers 8, 16, 22 of Llama-3.2-3B, trained via knowledge distillation from the full model (frozen base, 3 epochs on WikiText-103)
Learned confidence calibration: Dedicated binary confidence estimator per exit layer predicting whether the early prediction matches the full model
Cascade exit strategy: Tokens route from shallowest to deepest exit, with the full model as fallback. Average compute cost: 0.414x of full forward pass

Finding: Self-speculative decoding without KV cache sharing is slower than standard decoding (0.5-0.84x), making the cascade approach strictly dominant under this constraint.

Project Structure

CascadeExit-Research/
├── CascadeExit_Speculative_Decoding.ipynb  # Full research pipeline
├── CascadeExit_Paper.md                    # Research paper
├── checkpoints/                            # Trained adapters + confidence estimators
├── results/                                # Evaluation metrics (JSON)
├── logs/                                   # Training logs
└── FINAL_SUMMARY.json                      # Complete execution summary

Usage

The full pipeline is in the Jupyter notebook. Requires an NVIDIA GPU with sufficient VRAM and HuggingFace access to Llama-3.2-3B.

pip install torch transformers accelerate datasets
jupyter notebook CascadeExit_Speculative_Decoding.ipynb

Hardware: Developed and tested on NVIDIA A100-SXM4-80GB. Total compute: 7.03 hours.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
logs		logs
results		results
.gitignore		.gitignore
CascadeExit_Paper.md		CascadeExit_Paper.md
CascadeExit_Speculative_Decoding.ipynb		CascadeExit_Speculative_Decoding.ipynb
FINAL_SUMMARY.json		FINAL_SUMMARY.json
LICENSE		LICENSE
README.md		README.md
progress.json		progress.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CascadeExit: Adaptive Early-Exit Speculative Decoding for LLM Inference Acceleration

Key Results

Token Routing Distribution (Cascade)

Approach

Project Structure

Usage

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CascadeExit: Adaptive Early-Exit Speculative Decoding for LLM Inference Acceleration

Key Results

Token Routing Distribution (Cascade)

Approach

Project Structure

Usage

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages