Testing LLMs' ability on operating 3D atomic structures.
"Forget the messy details, I just need a model that can play Lego with atoms." ⚛️🤖
- Installation
- Usage of the Bench
- [Usage of the Gym]
- under construction
- Contributing
- License
- Citation
pip install -e .If you want to run the benchmark for your own model, implement your model in src/models/ and corresponding parameters in config/models.yaml. Currently, we have implemented openai_model, azure_openai_model, huggingface_model, and vllm_model.
python ./src/run_benchmark.py -t [benchmark_type] -m [model_name] -a [action_name] -b [batch_size] -n [num_batch]Arguments:
| Argument | Description |
|---|---|
benchmark_type |
Benchmark to run. See Available Benchmarks. |
model_name |
Model to test (e.g., deepseek_chat). |
action_name |
Action to test (see Available Actions). Only for AtomWorld and PointWorld. |
batch_size |
Number of parallel LLM calls (default: 50). |
num_batch |
Number of batches to test (default: all data). |
atomworld: AtomWorldpointworld: PointWorldcifgen: CIFGencifrepair: CIFRepair
For the StructProp task, see below.
AtomWorld:
- add_atom_action
- change_atom_action
- delete_around_atom_action
- delete_below_atom_action
- insert_between_atoms_action
- move_around_atom_action
- move_atom_action
- move_selected_atoms_action
- move_towards_atom_action
- remove_atom_action
- rotate_around_atom_action
- swap_atoms_action
PointWorld:
- move
- move_towards
- insert_between
- rotate_around
To get CIFs from LLM for StructProp:
python ./src/struct_prop_bench/inferring.py -m [model_name] -p [property] -b [batch_size] -n [num_batch]Then run your own calculation pipelines. The results should be saved with the format similar to ./results/StructPropBench/dft_statistics.csv in order to use the ./src/scripts/analyze_structprop_results.py for final metrics. Or you can modify the analysis script for your own results.
In the new codes, the results are saved in ./results/[BenchmarkType]/[ModelName]/[ActionName]/[Timestamp]/. The evaluation_results.csv contains the correct results, and evaluation_wrongs.csv contains the incorrect ones. metrics.json contains the summary of the metrics.
You can now request an automatic max_dist histogram to be generated after a benchmark run by adding the --plot flag to run_benchmark.py. The runner supports plotting for atomworld, pointworld, and cifgen benchmarks. The plot is saved to the same results folder as evaluation_results.csv and will not open an interactive window by default.
Examples:
python .\src\run_benchmark.py -t atomworld -m deepseek_chat -a move_atom_action -b 10 -n 1 --plot
python .\src\run_benchmark.py -t cifgen -m deepseek_chat -b 10 -n 1 --plotThe actions and data_generator are currently under refactoring. The current pipeline will be updated soon. If you want to construct your own data, you can follow the steps below:
- (Optional) Download random structures:
The input CIFs we used are available in
python src/scripts/download_random_mp_data.py --api_key [YOUR_API_KEY] --out_path [path] --min_natoms [min_atoms] --max_natoms [max_atoms] --num_entries [total_entries]
./src/data/input_cifs.zip. - Generate data:
python src/atom_world/data_generator.py
- Convert to h5:
python src/scripts/convert_cifs_to_h5.py
- Put the generated [action_name].csv and [action_name].h5 files in
./src/data/. Then you can run the benchmark with your own data.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
This project is licensed under the MIT License. See the LICENSE file for details.
@misc{lv2025atomworldbenchmarkevaluatingspatial,
title={AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials},
author={Taoyuze Lv and Alexander Chen and Fengyu Xie and Chu Wu and Jeffrey Meng and Dongzhan Zhou and Bram Hoex and Zhicheng Zhong and Tong Xie},
year={2025},
eprint={2510.04704},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci},
url={https://arxiv.org/abs/2510.04704},
}
