Skip to content

opendatalab/REST

Repository files navigation

📉 REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

GitHub Repo Paper on arXiv OpenCompass Discord Quick Start

Overview

📝 Key Takeaways

📉 Even SOTA models like DeepSeek-R1 exhibit substantial performance degradation under stress testing.

Tabel: Performance of DeepSeek-R1 under traditional single-question testing (single) and multi-question stress testing (stress).

Mode GSM8K MATH500 AMC23 AIME24 AIME25 GPQA Diamond LiveCodeBench(v5)
Single 96.20 97.00 93.75 81.66 68.75 70.20 63.44
Stress 96.16 92.09 81.80 52.49 37.17 64.63 40.83

📊 REST enhances the discriminative power of existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations.

Tabel: Performance of different LRMs on MATH500 under multi-question stress testing (stress).

Mode DS-R1-1.5B L1-Qwen-1.5B-Max DS-R1-7B AReaL-boba-RL-7B OpenR1-Qwen-7B Nemotron-Nano-8B DS-R1-32B DeepSeek-R1
Single 83.40 83.40 93.00 95.00 92.20 94.40 94.60 97.00
Stress 42.47 73.23 66.75 60.77 81.64 86.04 88.97 92.09

💡 "Overthinking" is a ritical factor contributing to the performance degradation and "Long2short" technique can help.

Figure: The effect of Long2Short training. Long2Short training mitigates the performance degradation under high stress levels (number of questions per input).

math500_1p5b_long2short

1.5B Models on MATH500

math500_7b_long2short

7B Models on MATH500

amc23_7b_long2short

7B Models on AMC23

Stress testing capable LRMs employ concise reasoning for earlier questions.

Figure: The reasoning token count for questions at different positions on AIME24 under stress testing.

DS-R1-Distill-Qwen-7B

DS-R1-Distill-Qwen-7B

Nemotron-nano-7B

Nemotron-nano-7B

DeepSeek-R1

DeepSeek-R1

🚀 Quick Start

LEMMA mainly requires the following three packages. To install them, simply run "bash sh/install.sh":

After installation, run the following scripts to reproduce our evaluation results. To evaluate API-based models, please specify the "OPENAI_API_BASE" and "OPENAI_API_KEY" in these scripts.

bash sh/eval_math.sh
# Code data will be automatically downloaded from OpenCompass
bash sh/eval_code.sh

To evaluate gpqa, we use gemma-3-27b-it to extract the answer for each question because LRMs othen fail to put each answer within "\boxed{}". We use SGLang to deploy gemma-3-27b-it, you can install it in another environment.

# Install sglang==0.4.4.post3 in another environment.
conda create -n sglang044 -y
conda activate sglang044
pip install --upgrade pip
pip install uv
uv pip install "sglang[all]==0.4.4.post3"

Set "VERIFYER_MODEL_NAME", "VERIFYER_API_BASE", "VERIFYER_API_KEY" in "sh/eval_gpqa.sh" and run inference and evaluation separately.

bash sh/eval_gpqa.sh infer
bash sh/serve_gemma3.sh &
bash sh/eval_gpqa.sh eval

To evaluate your own model, you can set "MODEL_NAME" (a valid huggingface model name), "TP_SIZE" and "TEMPERATURE" in "eval_custom_model.sh".

bash sh/eval_huggingface_model.sh

Thanks for the open source code of OpenCompass.

Citation

Please cite the paper if you refer to our code, result or paper.

@misc{pan2025REST,
    title={REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once}, 
    author={Zhuoshi Pan and Qizhi Pei and Yu Li and Qiyao Sun and Zinan Tang and H. Vicky Zhao and Conghui He and Lijun Wu},
    year={2025},
    eprint={2507.10541},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2507.10541}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published