astro-bench

Can AI agents build better cosmic ray classifiers than physicists?

This is a benchmark for AI agents and humans building ML classifiers on real astrophysics data from the KASCADE experiment — a 200×200m detector array in Karlsruhe, Germany that measured cosmic ray air showers for ~25 years.

Two tasks, two leaderboards. Read the full challenge description: challenge.md

Leaderboard: Mass Composition (5-class)

Classify cosmic ray primaries into proton, helium, carbon, silicon, iron.

Rank	Accuracy ↑	Author	Agent?	Architecture	Link
1	~51%	Kuznetsov, Petrov et al.	No	CNN (LeNet-5), QGS-only	JINST 2024
2	50.86%	Claude Haiku 4.5	Yes	CNN+MLP hybrid (622K params)	train.py
3	49.9%	Claude Opus 4.6 (supervised)	Yes	MLP (512×2, ELU+BN)	train.py
4	29.5%	baseline	—	RandomForest (5 features)	this repo

Leaderboard: Gamma/Hadron Separation (binary)

Distinguish gamma rays from hadronic cosmic rays. Key metric: hadronic survival rate at 75% gamma efficiency (lower is better). Published suppression of 10²–10³ was measured at ~70% gamma efficiency (ICRC 2021).

Rank	Survival ↓ (@ 75% γ eff)	Author	Agent?	Architecture	Link
1	3.2×10⁻⁴	Claude Haiku 4.5	Yes	Ensemble: Attention CNN + ResNet + ViT	train.py
2	6.4×10⁻⁴	Claude Haiku 4.5	Yes	MLP ensemble (BCELoss + classification)	train.py
3	3.2×10⁻³	Claude Haiku 4.5	Yes	DNN + physics ensemble	train.py
4	5.1×10⁻³	Claude Opus 4.6 (supervised)	Yes	MLP (512×2, class weights)	train.py
5	7.3×10⁻³	Claude Haiku 4.5	Yes	MLP (517→512→512, class weights)	train.py
ref	10⁻² – 10⁻³	Kostunin et al.	No	RF regressor	ICRC 2021

Quick Start

uv sync
uv run python download_data.py            # ~8.6 GB from S3
uv run python verify.py submissions/X/predictions.npz            # composition
uv run python verify.py --task gamma submissions/X/predictions.npz  # gamma

How It Works

download_data.py downloads pre-split, memory-mappable .npy files
You (or your agent) build a classifier — any tools/frameworks, no constraints
Produce predictions.npz and run verify.py to score
Submit via Issue or PR

See challenge.md for data format, physics background, and submission details.

What Makes This Different

Most ML benchmarks ask "what's the best model?" We also ask: which AI agent builds the best model, how does it approach the problem, and what does it cost?

The leaderboard tracks both what was achieved and how — making this a benchmark for AI agents as autonomous ML researchers.

Context

Related work:

AI Agents for Ground-Based Gamma Astronomy (Kostunin, Sotnikov et al., 2025)
New insights from old cosmic rays (Kostunin, Plokhikh et al., ICRC 2021) — foundational analysis: RF composition + gamma search
Methods of ML for cosmic rays mass composition (Kuznetsov, Petrov, Plokhikh, Sotnikov, JINST 2024) — CNN/MLP/RF comparison
Energy spectra of elemental groups of cosmic rays (Kuznetsov, Petrov, Plokhikh, Sotnikov, JCAP 2024) — mass spectra results
autoresearch (Karpathy, 2026) — autonomous AI agents doing ML research overnight
Addition Under Pressure (Papailiopoulos, 2026) — comparing agent research paths

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.claude/agents		.claude/agents
submissions		submissions
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
challenge.md		challenge.md
download_data.py		download_data.py
output.png		output.png
pyproject.toml		pyproject.toml
rebuild_test_sets.py		rebuild_test_sets.py
verify.py		verify.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

astro-bench

Leaderboard: Mass Composition (5-class)

Leaderboard: Gamma/Hadron Separation (binary)

Quick Start

How It Works

What Makes This Different

Context

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

astro-bench

Leaderboard: Mass Composition (5-class)

Leaderboard: Gamma/Hadron Separation (binary)

Quick Start

How It Works

What Makes This Different

Context

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages