BALROG is a novel benchmark evaluating agentic LLM and VLM capabilities on long-horizon interactive tasks using reinforcement learning environments. Check out how current models fare on our leaderboard. You can read more about BALROG in our paper.
- Comprehensive evaluation of agentic abilities
- Support for both language and vision-language models
- Integration with popular AI APIs and local deployment
- Easy integration for custom agents, new environments and new models
We advise using conda for the installation
conda create -n balrog python=3.10 -y
conda activate balrog
git clone https://github.com/balrog-ai/BALROG.git
cd BALROG
pip install -e .
balrog-post-install
We have provided some docker images. Please see the relevant README.
We support running LLMs/VLMs locally using vLLM. You can spin up a vLLM client and evaluate your agent on BALROG in the following way:
pip install vllm numpy==1.23
vllm serve meta-llama/Llama-3.2-1B-Instruct --port 8080
python eval.py \
agent.type=naive \
agent.max_image_history=0 \
agent.max_history=16 \
eval.num_workers=32 \
client.client_name=vllm \
client.model_id=meta-llama/Llama-3.2-1B-Instruct \
client.base_url=http://0.0.0.0:8080/v1
Check out vLLM for more options on how to serve your models fast and efficiently.
We support out of the box clients for OpenAI, Anthropic and Google Gemini APIs. First set up your API key:
export OPENAI_API_KEY=<KEY>
export ANTHROPIC_API_KEY=<KEY>
export GEMINI_API_KEY=<KEY>
Then run the evaluation with:
python eval.py \
agent.type=naive \
agent.max_image_history=0 \
eval.num_workers=64 \
client.client_name=openai \
client.model_id=gpt-4o-mini-2024-07-18
- Evaluation Guide - Detailed instructions for various evaluation scenarios
- Agent Development - Tutorial on creating custom agents
We welcome contributions! Please see our Contributing Guidelines for details.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use BALROG in any of your work, please cite:
@article{paglieri2024balrog,
title={Benchmarking Agentic LLM and VLM Reasoning On Games},
author={Paglieri, Davide and Cupia{\l}, Bart{\l}omiej and Coward, Sam and Piterbarg, Ulyana and Wo{\l}czyk, Maciej and Khan, Akbir and Pignatelli, Eduardo and Kuci{\'n}ski, {\L}ukasz and Pinto, Lerrel and Fergus, Rob and Foerster, Jakob Nicolaus and Parker-Holder, Jack and Rockt{\"a}schel, Tim},
journal={arXiv preprint arXiv:2411.13543},
year={2024}
}