📉 REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

📝 Key Takeaways

📉 Even SOTA models like DeepSeek-R1 exhibit substantial performance degradation under stress testing.

Tabel: Performance of DeepSeek-R1 under traditional single-question testing (single) and multi-question stress testing (stress).

Mode	GSM8K	MATH500	AMC23	AIME24	AIME25	GPQA Diamond	LiveCodeBench(v5)
Single	96.20	97.00	93.75	81.66	68.75	70.20	63.44
Stress	96.16	92.09	81.80	52.49	37.17	64.63	40.83

📊 REST enhances the discriminative power of existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations.

Tabel: Performance of different LRMs on MATH500 under multi-question stress testing (stress).

Mode	DS-R1-1.5B	L1-Qwen-1.5B-Max	DS-R1-7B	AReaL-boba-RL-7B	OpenR1-Qwen-7B	Nemotron-Nano-8B	DS-R1-32B	DeepSeek-R1
Single	83.40	83.40	93.00	95.00	92.20	94.40	94.60	97.00
Stress	42.47	73.23	66.75	60.77	81.64	86.04	88.97	92.09

💡 "Overthinking" is a ritical factor contributing to the performance degradation and "Long2short" technique can help.

Figure: The effect of Long2Short training. Long2Short training mitigates the performance degradation under high stress levels (number of questions per input).

1.5B Models on MATH500

7B Models on MATH500

7B Models on AMC23

✅ Stress testing capable LRMs employ concise reasoning for earlier questions.

Figure: The reasoning token count for questions at different positions on AIME24 under stress testing.

DS-R1-Distill-Qwen-7B

Nemotron-nano-7B

DeepSeek-R1

🚀 Quick Start

LEMMA mainly requires the following three packages. To install them, simply run "bash sh/install.sh":

After installation, run the following scripts to reproduce our evaluation results. To evaluate API-based models, please specify the "OPENAI_API_BASE" and "OPENAI_API_KEY" in these scripts.

bash sh/eval_math.sh
# Code data will be automatically downloaded from OpenCompass
bash sh/eval_code.sh

To evaluate gpqa, we use gemma-3-27b-it to extract the answer for each question because LRMs othen fail to put each answer within "\boxed{}". We use SGLang to deploy gemma-3-27b-it, you can install it in another environment.

# Install sglang==0.4.4.post3 in another environment.
conda create -n sglang044 -y
conda activate sglang044
pip install --upgrade pip
pip install uv
uv pip install "sglang[all]==0.4.4.post3"

Set "VERIFYER_MODEL_NAME", "VERIFYER_API_BASE", "VERIFYER_API_KEY" in "sh/eval_gpqa.sh" and run inference and evaluation separately.

bash sh/eval_gpqa.sh infer
bash sh/serve_gemma3.sh &
bash sh/eval_gpqa.sh eval

To evaluate your own model, you can set "MODEL_NAME" (a valid huggingface model name), "TP_SIZE" and "TEMPERATURE" in "eval_custom_model.sh".

bash sh/eval_huggingface_model.sh

Thanks for the open source code of OpenCompass.

Citation

Please cite the paper if you refer to our code, result or paper.

@misc{pan2025REST,
    title={REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once}, 
    author={Zhuoshi Pan and Qizhi Pei and Yu Li and Qiyao Sun and Zinan Tang and H. Vicky Zhao and Conghui He and Lijun Wu},
    year={2025},
    eprint={2507.10541},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2507.10541}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
asset		asset
data		data
docs		docs
examples		examples
opencompass		opencompass
sh		sh
tests		tests
tools		tools
.codespellrc		.codespellrc
.gitignore		.gitignore
.owners.yml		.owners.yml
.pre-commit-config-zh-cn.yaml		.pre-commit-config-zh-cn.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
dataset-index.yml		dataset-index.yml
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📉 REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

📝 Key Takeaways

🚀 Quick Start

Citation

About

Uh oh!

Releases

Packages

Languages

License

opendatalab/REST

Folders and files

Latest commit

History

Repository files navigation

📉 REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

📝 Key Takeaways

🚀 Quick Start

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages