GSM-Infinite: How Do Your LLMs Behave over Infinitely
Increasing Context Length and Reasoning Complexity?

GSM-Infinite is a reasoning benchmarks that is completely synthetic without LLMs in the loop, capable of generating problems of context length and reasoning complexity that are infinitely scalable. Inspired by Physics of Language Model 2.1, we use abstract grade school level math problems in to computational graph and through graph manipulation and graph-language mapping to generate LLM-readable (also, Human-readable) problems.

Yang Zhou*¹, Hongyi Liu*¹, Zhuoming Chen¹, Yuandong Tian², Beidi Chen¹,

*Equal Contributions, order decided by a coin flip

¹Carnegie Mellon University ²Meta AI

[Paper] | [Blog] | [🤗Leaderboards at huggingface] | [Datasets]

Limitation of Existing Long-context Benchmark

RAG can robustly solve most of today popular long-context benchmarks

In this paper, we first point out the insufficiencies in long-context LLMs evaluation, highlighting:

Lack of reasoning complexity: Most tasks rely on text retrieval, text summarization, QA.
Lack of context length: Some tasks are inherently short-context tasks but are bloated to long-context through injecting semantically irrelevant noise.
Lack of scalability: Admittedly, tasks with high reasoning complexity and high information density exists, but these tasks requires huge human-effort to gather, dedup, and verify. The result is lack of scalability in quantity, making it hard to prevail in the community.

First two is further studied in the above figure. These tasks are not tasks that only long-context LLMs can do. We show that RAG are robust and have performance on par with long-context LLMs. However, given the high efficiency to build and run inference on RAG systems, RAG is more favorable in practice on these tasks. Therefore, we have the following problem to solve.

Problem Statement: How can we develop a benchmark that contains sufficient problems at every fine-grained level of reasoning difficulty, from easy retrieval tasks to infinitely hard challenges, while providing infinitely customizable context length with high information density?

GSM-Infinite

We present GSM-Infinite, a benchmark with test examples completely synthesized, thus can scaled up infinitely in both context length and reasoning complexity.

Importantly, the context length generated is high in information density, which can be seen from the following study in the Figure.

RAG methods performance on GSM-Infinite.

(a) and (b) show that retriever, all-mpnet-base-v2, cannot differentiate the close noise we generate from the essential block as they can comfortably do for vt in RULER. (c) and (d) show that retriever's performance is much lower in both Medium and Hard subset of GSM-Infinite than long-context LLMs, showing that the tasks are only Long-context LLM solvable.

Leaderboards

Here we provided both the Zero Noise and Long-context leader board. Since the Leaderboard will be updated from time to time. Please be sure to checkout our huggingface space website for the latest models and updates.

Firstly, we evaluated 18 models on GSM-Infinite Zero Noise. The performance are as follows.

Models	Symbolic	Medium	Hard	1st<50% op on Hard	1st<10% op on Hard	Avg. Acc op≤30 on Hard	Average↑
DeepSeek-R1	7280.0	9750.85	8573.8	100	>130	0.9427	8534.88
GPT-o3-mini	6690.0	8335.66	5769.96	70	110	0.9423	6931.88
GPT-o1-mini	5060.0	6054.91	3738.43	50	90	0.8397	4951.11
DeepSeek-V3	4310.0	4100.81	2407.86	24	55	0.6669	3606.22
QwQ-32B-preview	3530.0	3205.75	1846.19	21	50	0.5403	2860.65
Gemini-1.5-Pro-002	2547.0	3659.59	2318.28	26	45	0.6924	2841.62
Claude-3.5-Sonnet	2161.0	3281.8	2115.79	26	40	0.6758	2519.53
Mistral-Large	2332.5	2879.92	2310.49	25	40	0.6645	2507.64
Qwen2.5-72B-Instruct	2048.0	2496.81	2016.38	21	40	0.5433	2187.06
GPT-4o	2379.0	2457.37	1451.54	18	30	0.5064	2095.97
Gemini-1.5-Flash-002	1970.0	1478.75	1274.25	19	30	0.4460	1574.33
Llama3.1-70B-Instruct	1769.0	1650.25	1205.25	10	30	0.4314	1541.50
MiniMax-Text-01	1618.5	1712.64	1178.51	14	30	0.4213	1503.22
GPT-4o-mini	1389.0	1406.5	913.89	12	22	0.3094	1236.46
Claude-3.5-Haiku	897.0	1053.16	784.34	10	22	0.2910	911.50
Qwen2.5-7B-Instruct	786.95	886.75	618.5	7	16	0.2257	764.07
Llama3.1-8B-Instruct	462.0	786.5	606.5	6	17	0.2212	618.30
Jamba-1.5-Large	856.0	485.13	466.4	6	26	0.1828	602.51

Secondly, we evaluated 11 models on GSM-Infinite long-context tasks.

Model	8K	16K	32K	Average↑
gemini-1.5-pro-002	1182.43	896.31	812.96	963.9
qwen-2.5-72b-instruct	927.33	681.53	563.65	724.17
mistral-large-2411	914.49	563.73	319.21	599.14
deepseek-v3	935.10	477.02	313.66	575.2
gemini-1.5-flash-002	673.88	476.72	377.38	509.3
llama-3.1-70b-instruct	479.00	394.50	355.5	409.67
minimax-text-01	481.32	359.56	325.95	388.94
gpt-4o-mini	401.00	337.81	275.63	338.15
qwen-2.5-7b-instruct	248.00	211.50	196.17	218.56
llama-3.1-8b-instruct	183.67	149.50	109.45	147.54

We present detailed description of data generation and evaluation findings uniquely benefited from the design of GSM-Infinite. Please make sure to checkout our paper.

Overview of the Code Organization

From the paper, we have three subtasks in GSM-Infinite. We have Symbolic, Medium, and Hard. The classification is mainly about semantic hierarchy. More details in the paper. Below is a menu of the organization of files and folders.

Symbolic
- Data
- Predictions
Realistic
- Data
- Predictions

The main components of the code are data generation and model evaluation scripts. Since there are some subtle differences between these two. We separate them into two different folders.

Environment Installation

pip install -r requirements.txt

If you want to serve model locally, please install platforms of your choice (vllm, sglang, etc.).

Generation and Evaluation of Symbolic Dataset

We provide a `run.sh` script to sample from and evaluate on the Symbolic dataset. Below is a quick walkthrough:

Navigate to the Symbolic directory
```
cd symbolic
```
In this repo, we recommend running evaluations with api calling mechanism. Even for open-source models, we advise either deploy models locally via vllm/sglang, or using api providers such as DeepInfra, etc.
Edit config.sh
- Set run_sampling to true if you want to sample new predictions from your model. Set to false to skip sampling.
```
run_sampling=true      # Set to true to sample from the model
```
- Set run_evaluation to true if you want to evaluate existing predictions (this requires an evaluation model, typically a smaller LLM, specified in EVAL_OPENAI_* variables). Set to false to skip evaluation.
```
run_evaluation=true    # Set to true to evaluate existing predictions
```
- Configure the sampling model details (if run_sampling=true):
  - backend_type: 'openai', 'gemini', or 'anthropic'
  - SAMPLER_OPENAI_BASE_URL and SAMPLER_OPENAI_API_KEY (or GEMINI_API_KEY or ANTHROPIC_API_KEY)
  - model_name, dataset_base (if you want to use custom datasets)
  - num_samples, temperature, max_tokens, etc.
- Configure the evaluation model details (if run_evaluation=true):
  - EVAL_OPENAI_BASE_URL and EVAL_OPENAI_API_KEY (for an OpenAI-compatible evaluation model)
Run the script
```
bash -x run.sh
```
Check your output
- New predictions (if sampled) will be saved in folder datasets.
- Evaluation results (if generated) will be in folder results.

If you want to generate the data yourself, please feel free to look into the data folder, and look into the generate_symbolic.sh. Then, fill in your dataset settings (name, ops, context length). Try hitting

bash -x generate_symbolic.sh

Generation and Evaluation of Realistic Dataset

h3>Generation and Evaluation of Realistic Dataset The Realistic dataset (Medium and Hard subsets) uses a similar process:

Navigate to the Realistic directory
```
cd realistic
```
Edit config.sh
- Fill in your API keys, backend type, model name, etc.
- Adjust lengths and dataset_suffixes to control which subsets and context lengths to process.
- Configure the model details
  - backend_type: 'openai', 'gemini', or 'anthropic'
  - OPENAI_BASE_URL and OPENAI_API_KEY (or GEMINI_API_KEY or ANTHROPIC_API_KEY)
  - model_name, dataset_base (if you want to use custom datasets)
  - num_samples, temperature, max_tokens, etc.
Run the script
```
bash -x run.sh
```
This script samples predictions and then automatically evaluates them using eval_realistic.py. Note that there is no separate run_evaluation flag here; evaluation always follows sampling.
Check your output
- New predictions will be saved in folder datasets.
- Evaluation results will be in folder results.

If you want to generate the data yourself, please feel free to look into the data folder, and look into the test_generate3.sh. Then, fill in your dataset settings (ops, context length). Try hitting

bash -x test_generate3.sh

Citation

If you think our code base is useful, please consider citing the code through the following bibtxt.

@misc{zhou2025gsminfinitellmsbehaveinfinitely,
        title={GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?}, 
        author={Yang Zhou and Hongyi Liu and Zhuoming Chen and Yuandong Tian and Beidi Chen},
        year={2025},
        eprint={2502.05252},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2502.05252}, 
  }

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
gsm-infinite		gsm-infinite
static		static
.gitignore		.gitignore
README.md		README.md
index.html		index.html
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GSM-Infinite: How Do Your LLMs Behave over Infinitely
Increasing Context Length and Reasoning Complexity?

Limitation of Existing Long-context Benchmark

GSM-Infinite

Leaderboards

Overview of the Code Organization

Environment Installation

Generation and Evaluation of Symbolic Dataset

Generation and Evaluation of Realistic Dataset

Citation

About

Releases

Packages

Contributors 4

Languages

Infini-AI-Lab/gsm_infinite

Folders and files

Latest commit

History

Repository files navigation

GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?

Limitation of Existing Long-context Benchmark

GSM-Infinite

Leaderboards

Overview of the Code Organization

Environment Installation

Generation and Evaluation of Symbolic Dataset

Generation and Evaluation of Realistic Dataset

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

GSM-Infinite: How Do Your LLMs Behave over Infinitely
Increasing Context Length and Reasoning Complexity?

Packages