Skip to content

vvsotnikov/astro-bench

Repository files navigation

astro-bench

Can AI agents build better cosmic ray classifiers than physicists?

This is a benchmark for AI agents and humans building ML classifiers on real astrophysics data from the KASCADE experiment — a 200×200m detector array in Karlsruhe, Germany that measured cosmic ray air showers for ~25 years.

Two tasks, two leaderboards. Read the full challenge description: challenge.md

Leaderboard: Mass Composition (5-class)

Classify cosmic ray primaries into proton, helium, carbon, silicon, iron.

Rank Accuracy ↑ Author Agent? Architecture Link
1 ~51% Kuznetsov, Petrov et al. No CNN (LeNet-5), QGS-only JINST 2024
2 50.86% Claude Haiku 4.5 Yes CNN+MLP hybrid (622K params) train.py
3 49.9% Claude Opus 4.6 (supervised) Yes MLP (512×2, ELU+BN) train.py
4 29.5% baseline RandomForest (5 features) this repo

Leaderboard: Gamma/Hadron Separation (binary)

Distinguish gamma rays from hadronic cosmic rays. Key metric: hadronic survival rate at 75% gamma efficiency (lower is better). Published suppression of 10²–10³ was measured at ~70% gamma efficiency (ICRC 2021).

Rank Survival ↓ (@ 75% γ eff) Author Agent? Architecture Link
1 3.2×10⁻⁴ Claude Haiku 4.5 Yes Ensemble: Attention CNN + ResNet + ViT train.py
2 6.4×10⁻⁴ Claude Haiku 4.5 Yes MLP ensemble (BCELoss + classification) train.py
3 3.2×10⁻³ Claude Haiku 4.5 Yes DNN + physics ensemble train.py
4 5.1×10⁻³ Claude Opus 4.6 (supervised) Yes MLP (512×2, class weights) train.py
5 7.3×10⁻³ Claude Haiku 4.5 Yes MLP (517→512→512, class weights) train.py
ref 10⁻² – 10⁻³ Kostunin et al. No RF regressor ICRC 2021

Quick Start

uv sync
uv run python download_data.py            # ~8.6 GB from S3
uv run python verify.py submissions/X/predictions.npz            # composition
uv run python verify.py --task gamma submissions/X/predictions.npz  # gamma

How It Works

  1. download_data.py downloads pre-split, memory-mappable .npy files
  2. You (or your agent) build a classifier — any tools/frameworks, no constraints
  3. Produce predictions.npz and run verify.py to score
  4. Submit via Issue or PR

See challenge.md for data format, physics background, and submission details.

What Makes This Different

Most ML benchmarks ask "what's the best model?" We also ask: which AI agent builds the best model, how does it approach the problem, and what does it cost?

The leaderboard tracks both what was achieved and how — making this a benchmark for AI agents as autonomous ML researchers.

Context

Related work:

License

MIT

About

Benchmarking AI agents as autonomous scientists on real astrophysics data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors