This repository contains the official implementation of our work “Capacity-Aware Inference”, which investigates test-time load balancing in Mixture of Experts (MoE) and proposes efficient inference algorithms to alleviate the straggler effect.
The Mixture of Experts (MoE) architecture scales large language models by activating only a sparse subset of experts per input, effectively improving efficiency without sacrificing capacity.
However, during inference under expert parallelism, MoE models suffer from load imbalance — some experts process far more tokens than others. As a result, the faster, underloaded experts must wait for the slowest, overloaded ones to finish, leading to a global delay, which we term the Straggler Effect.
To address this issue, we introduce two complementary inference strategies:
- Capacity-Aware Token Drop — Enforces expert capacity limits by dropping excess tokens from overloaded experts, effectively reducing load imbalance with negligible performance loss (e.g., 30% speedup with only 0.9% degradation on OLMoE).
- Capacity-Aware Expanded Drop — Further enhances utilization by expanding token routing to nearby low-load experts before applying local capacity constraints, leading to more balanced expert workloads and improved inference efficiency.
Extensive experiments on both language and multimodal MoE models validate our approach, showing substantial improvements in expert utilization, throughput, and model performance.
For example, applying Expanded Drop to Mixtral-8×7B-Instruct achieves a 1.85× inference speedup with a 0.2% average performance gain.
|
Token Drop: Tokens exceeding expert capacity are dropped to mitigate straggler effects. |
Expanded Drop: Tokens are allowed to expand to additional low-load experts before dropping. |
To install dependencies:
pip install -r requirements.txtWe provide minimal working examples based on Hugging Face Transformers modeling files.
For system-level integration and large-scale deployment, please refer to the Megatron-LM framework.
Evaluation can be conducted using:
- lm-evaluation-harness for language benchmarks
- VLMEvalKit for multimodal benchmarks
We modify their inference logic to incorporate capacity-aware routing under varying capacity factors.
If you find this work useful, please cite:
@misc{he2025capacityawareinferencemitigatingstraggler,
title={Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts},
author={Shwai He and Weilin Cai and Jiayi Huang and Ang Li},
year={2025},
eprint={2503.05066},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2503.05066},
}