This repository contains the code and benchmark for the ICML 2025 paper: "Gravity-Bench-v1: A Benchmark on Gravitational Physics Discovery for Agents" (arXiv:2501.18411). Visit our project website at gravitybench.github.io for discussion on the results.
Gravity-Bench is a benchmark that evaluates AI agents on solving gravitational physics problems by providing them tools to observe a two-body gravitational system and write code to solve the problem.
Our range of tasks include difficult problems, such as determining how we modified gravity and determining the coefficient of drag that has been added to the system (problems not often seen in textbooks).
These questions challenge AI agents with skills that mirror real-world science, including iterative reasoning, planning, and generalization.
-
Physics-based Environment: Uses high-precision simulations capturing realistic two-body orbital dynamics. Simulations are configured in
scripts/scenarios_config.pyand are simulated ingeneralscenarios/Binary.py. Observational data is stored inscenarios/sims/(for agent use) and detailed simulation data for verification is inscenarios/detailed_sims/. -
Partial Observability: Agents can operate in two modes:
- Full-Observation (full-obs): Access the complete simulation data at once via a
pandasdataframe (e.g., viascripts/run_agent.py). - Budgeted Observation (budget-obs): Agent must collect observations incrementally under a defined observation budget, usually 100 observations (e.g., via
scripts/run_agent.py --row-wise). This mode emphasizes strategic observation planning.
- Full-Observation (full-obs): Access the complete simulation data at once via a
-
Tasks: Each task is assigned to multiple simulation variations, configured in
scripts/scenarios_config.json. See Appendix B of our paper for a detailed description of benchmark problems. -
Expert Reference Solutions & Ground Truth:
- Expert Reference Solutions: Implemented in each task's
.pyfile (e.g.,scenarios/mass_star1.py), these solutions emulate an expert's approach using only the observable data (full-obs or budget-obs with uniform sampling). They serve as a baseline for agent performance and are generated usingscripts/run_expert_solution.py. - The same
true_answermethods can also return exact values derived from simulation inputs or the simulations internal calculations (whenverification=Trueandreturn_empirical=False). This is used for verifying the correctness of the environment and the empirical solutions.
- Expert Reference Solutions: Implemented in each task's
-
Agent: The benchmark is designed for iterative, tool-augmented agents. The provided baseline agent (
agents/tabular_agent.py) uses:- An
Observetool for taking measurements at specific times (agents/tools/observe_tool.py). - A Python REPL tool for data analysis (
agents/tools/python_repl_tool.py).- Warning: The Python REPL tool executes arbitrary code without containerization and should be used with caution.
- A
submit_answertool for providing the final numeric or boolean answer (agents/tools/submit_answer_tool.py). This agent is executed usingscripts/run_agent.pyorscripts/run_agent_range_of_budgets.py(for varying observation budgets). The agent prompts are detailed in Appendix D of our paper.
- An
-
Install Dependencies:
uv is a fast Python package manager that automatically handles virtual environments.
curl -LsSf https://astral.sh/uv/install.sh | sh # Or: pip install uv # Install dependencies (automatically creates virtual environment) uv sync # Activate the environment (optional - uv run will auto-activate) source .venv/bin/activate
# Install dependencies using conda conda env create -f environment.yml conda activate gravitybench# Create and activate virtual environment python -m venv .venv source .venv/bin/activate # Install dependencies pip install -r requirements.txt
Note for Windows Users: The benchmark code relies on the rebound physics simulation library, which does not support Windows. Windows users should use our standalone quick_start.py script instead, which works with our Huggingface dataset and doesn't require rebound. See the "Running your own models and agents" section below for details.
-
Configure API Keys:
Create a
.envfile with your API keys using.env.example. -
Run a Scenario (Full Observations): This example runs the agent on the
period.pyscenario task using thegpt-4omodel with full observation access.python scripts/run_agent.py --scenarios modified_gravity_power_law --model gpt-4.1
Outputs (a JSON file and an HTML report) will be saved in a new subdirectory under
outputs/, named likeoutputs/gpt-4o_<timestamp>/. -
Run a Scenario (Budgeted Observations): This example runs the agent on the
max_velocity_star1.pyscenario task, using a Claude model, with a budget of 100 total observations, requesting up to 10 at a time.python scripts/run_agent.py \ --scenarios max_velocity_star1 \ --model gpt-4.1 \ --row-wise \ --max-observations-total 100 \ --max-observations-per-request 10 -
Explore Results: After each run, navigate to the
outputs/(oroutputs_range_of_N/) directory. Inside the run-specific subfolder (e.g.,outputs/gpt-4o_<timestamp>/), you will find:- A
<model>_<timestamp>.jsonfile containing detailed run data. - A
<model>_<timestamp>.htmlfile, which is an interactive report. Open this in your browser to see a summary and step-by-step logs for each scenario attempt.
- A
To run your own model you can modify agents/tabular_agent.py. To run your own agent framework, we recommend adapting our quick_start.py which contains a minimal all-in-one example for a simple agent framework using our Huggingface dataset.
To reproduce the main results presented in the paper (arXiv:2501.18411), follow these general steps:
-
Run Experiments with Different Agents and Settings: Refer to Section 4 of the paper for details on models and configurations tested.
- Full-Observation Mode:
python scripts/run_agent.py --simulate-all --model gpt-4o-mini-2024-07-18
- Budgeted-Observation Mode (e.g., 100 observations):
python scripts/run_agent.py --simulate-all --model gpt-4o-mini-2024-07-18 --row-wise --max-observations-total 100
- Varying Observation Budgets (for specific scenarios/models, see Figure 2 in paper):
python scripts/run_agent_range_of_budgets.py --model gpt-4o-2024-11-20 --scenarios max_velocity_star1 periastron --variation "9.6 M, 3.1 M" --variation "3.1 M, 0.18 M, Elliptical, Single Orbit"
- Full-Observation Mode:
-
Generate Expert Baseline Data: This script calculates the performance of the expert-defined empirical solutions.
python scripts/run_expert_solution.py
This will create
outputs/expert_baseline_results.csv. -
Aggregate Raw Experiment Results: After running the agent experiments, combine their JSON outputs:
python outputs/combine_results.py
This creates
outputs/combined_results.csvandoutputs/chat_histories.csv.For results from
run_agent_range_of_budgets.py:python outputs_range_of_N/aggregate.py
This creates
outputs_range_of_N/aggregated_results.csv. -
Generate Figures and Tables: Run the analysis scripts located in the
analysis/directory. These scripts use the aggregated CSV files.python analysis/table1_scores.py # For Table 1 python analysis/table2_massassumption.py # For Table 2 in Appendix python analysis/fig_casestudy.py # For Figure 3 (Traces) - Note: this script uses specific hardcoded observation data. python analysis/plot_100_obs_human_performance.py # For Figure 4 (Thresholds) python outputs_range_of_N/fig_maxvelocity_and_periastron.py # For Figure 2 (Varying Budgets) python outputs_range_of_N/fig_periastron.py # For Figure 5 (Periastron Case Study)
The generated figures will typically be saved in
analysis/plots/oroutputs_range_of_N/plots/, and LaTeX tables inanalysis/tables/.
If you use Gravity-Bench in your research, please cite our paper:
@misc{koblischke2025gravitybenchv1,
title={Gravity-Bench-v1: A Benchmark on Gravitational Physics Discovery for Agents},
author={Nolan Koblischke and Hyunseok Jang and Kristen Menou and Mohamad Ali-Dib},
year={2025},
eprint={2501.08411},
archivePrefix={arXiv},
primaryClass={cs.AI}
}This project is licensed under the MIT License - see the LICENSE file for details.

