Measuring all the noises of LLM evals

🧐 About | 🚀 Quick Start | 📝 Citation | 🙏 Acknowledgements

🧐 About

This code measures the prediction noise, data noise, and total noise on many LLMs/agents and evals. These measurements allow us to estimate the statistical significance of any results on these evals. Using paired analysis, we find that the model predictions are responsible for more noises than the data, thus we can usually detect effect sizes 2 times smaller on unrelated models and 6+ times smaller on related models by reducing the prediction noise.

The reference noise measurements for many evals are here, which links interactive figures such as noises vs. accuracy, and the predictions heatmaps. For results based on one prediction per example, use the total standard error shown under SE(A-B). For the potential of noise reduction, the remaining data standard error is shown under SE_x(A-B).

How?

Since LLMs generates independent and diverse samples, we can draw multiple samples per question to measure the noise components.
estimators.py contains the Paired and Unpaired estimators to do this.
By measuring the noise of many pairs of models, some clear patterns emerge.

The original Eval-Arena docs are at doc/eval-arena-readme.md.

🚀 Quick Start

To generate the static summaries and figures, install requirements and set OUTPATH

 python -u run_arena.py data="data/vllm_evals/highk_temp0.7.jsonl" \
    out_dir=${OUTPATH}/highk_temp0.7 \
    max_diff=0.2 recompute=True

To view the results,

cd ${OUTPATH}/highk_temp0.7
python -m http.server

Data

The question level metrics are stored in this format:

{"benchmark_id":"humaneval", "model":"code-llama-multi-34b", "example_id":"HumanEval4", "pass1":1, "correct":2, "count":2}
{"benchmark_id":"CRUXEval-input", "model":"phind", "example_id":"CRUXEval-input0", "pass1":0.8, "correct":4, "count":5}

benchmark_id, model and example_id should together be unique. pass1 is the ratio of correct results out of count attempts.

Data contributions are welcome via pull requests to data.

The datasets used to produce the results is in this release. The corresponding runs are at submit_all.sh

This data is visualized by heatmaps, accessible through the "data" link of each eval from the main table. All raw data files and figures can be accessed through the "raw" link, which points to each benchmark's raw_index.html.

📝 Citation

@misc{wang2025allthenoises,
      title={Measuring all the noises of LLM Evals},
      author={Sida Wang},
      year={2025},
      eprint={2512.21326},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.21326},
}

🙏 Acknowledgements

I thank Sean O’Brien, Lovish Madaan, Dieuwke Hupkes, Alex Gu, Jiawei Liu, Yuhang Lai, Linyuan Gong, and Sten Sootla for making question-level data available for analysis. I am extremely grateful to Evan Miller, Nicolas Usunier, Zach Rait, Yuxiang Wei, Jannik Kossen, and Ari Holtzman for valuable discussions and feedback; Pedro Rodriguez, Ofir Press, Naman Jain, Baptiste Rozière, Gabriel Synnaeve, Dawn Song, and Zijian Wang for their advice and support. The all-pairs approach is inspired by Chatbot Arena and the clarity of Miller (2024) greatly helped.

Name		Name	Last commit message	Last commit date
Latest commit History 189 Commits
data/vllm_evals		data/vllm_evals
doc		doc
notebooks		notebooks
scripts		scripts
templates		templates
.gitignore		.gitignore
README.md		README.md
arena.py		arena.py
estimators.py		estimators.py
figures.py		figures.py
reports.py		reports.py
requirements.txt		requirements.txt
run_arena.py		run_arena.py
submit_all.sh		submit_all.sh
test_estimators.py		test_estimators.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Measuring all the noises of LLM evals

🧐 About

How?

🚀 Quick Start

Data

📝 Citation

🙏 Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors 4

Uh oh!

Languages

all-the-noises/eval-arena

Folders and files

Latest commit

History

Repository files navigation

Measuring all the noises of LLM evals

🧐 About

How?

🚀 Quick Start

Data

📝 Citation

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors 4

Uh oh!

Languages

Packages