Can AI agents build better cosmic ray classifiers than physicists?
This is a benchmark for AI agents and humans building ML classifiers on real astrophysics data from the KASCADE experiment — a 200×200m detector array in Karlsruhe, Germany that measured cosmic ray air showers for ~25 years.
Two tasks, two leaderboards. Read the full challenge description: challenge.md
Classify cosmic ray primaries into proton, helium, carbon, silicon, iron.
| Rank | Accuracy ↑ | Author | Agent? | Architecture | Link |
|---|---|---|---|---|---|
| 1 | ~51% | Kuznetsov, Petrov et al. | No | CNN (LeNet-5), QGS-only | JINST 2024 |
| 2 | 50.86% | Claude Haiku 4.5 | Yes | CNN+MLP hybrid (622K params) | train.py |
| 3 | 49.9% | Claude Opus 4.6 (supervised) | Yes | MLP (512×2, ELU+BN) | train.py |
| 4 | 29.5% | baseline | — | RandomForest (5 features) | this repo |
Distinguish gamma rays from hadronic cosmic rays. Key metric: hadronic survival rate at 75% gamma efficiency (lower is better). Published suppression of 10²–10³ was measured at ~70% gamma efficiency (ICRC 2021).
| Rank | Survival ↓ (@ 75% γ eff) | Author | Agent? | Architecture | Link |
|---|---|---|---|---|---|
| 1 | 3.2×10⁻⁴ | Claude Haiku 4.5 | Yes | Ensemble: Attention CNN + ResNet + ViT | train.py |
| 2 | 6.4×10⁻⁴ | Claude Haiku 4.5 | Yes | MLP ensemble (BCELoss + classification) | train.py |
| 3 | 3.2×10⁻³ | Claude Haiku 4.5 | Yes | DNN + physics ensemble | train.py |
| 4 | 5.1×10⁻³ | Claude Opus 4.6 (supervised) | Yes | MLP (512×2, class weights) | train.py |
| 5 | 7.3×10⁻³ | Claude Haiku 4.5 | Yes | MLP (517→512→512, class weights) | train.py |
| ref | 10⁻² – 10⁻³ | Kostunin et al. | No | RF regressor | ICRC 2021 |
uv sync
uv run python download_data.py # ~8.6 GB from S3
uv run python verify.py submissions/X/predictions.npz # composition
uv run python verify.py --task gamma submissions/X/predictions.npz # gammadownload_data.pydownloads pre-split, memory-mappable.npyfiles- You (or your agent) build a classifier — any tools/frameworks, no constraints
- Produce
predictions.npzand runverify.pyto score - Submit via Issue or PR
See challenge.md for data format, physics background, and submission details.
Most ML benchmarks ask "what's the best model?" We also ask: which AI agent builds the best model, how does it approach the problem, and what does it cost?
The leaderboard tracks both what was achieved and how — making this a benchmark for AI agents as autonomous ML researchers.
Related work:
- AI Agents for Ground-Based Gamma Astronomy (Kostunin, Sotnikov et al., 2025)
- New insights from old cosmic rays (Kostunin, Plokhikh et al., ICRC 2021) — foundational analysis: RF composition + gamma search
- Methods of ML for cosmic rays mass composition (Kuznetsov, Petrov, Plokhikh, Sotnikov, JINST 2024) — CNN/MLP/RF comparison
- Energy spectra of elemental groups of cosmic rays (Kuznetsov, Petrov, Plokhikh, Sotnikov, JCAP 2024) — mass spectra results
- autoresearch (Karpathy, 2026) — autonomous AI agents doing ML research overnight
- Addition Under Pressure (Papailiopoulos, 2026) — comparing agent research paths
MIT