The ML Research Benchmark Baseline Agent is an agentic system designed to serve as a baseline for various AI and machine learning tasks. This agent provides a foundation for comparing and evaluating machine learning research and development tasks that agents can perform.
- 📎 ML Research Benchmark Paper
- 🤖 ML Research Agent
- ✅ ML Research Tasks
- 📈 ML Research Evaluation
- 📓 Full Documentation
- Supports multiple AI/ML tasks
- Compatible with different LLM providers (OpenAI, Anthropic)
- Dockerized for easy deployment and reproducibility
The baseline agent can perform the following tasks:
- LLM Efficiency
- Baby Language Model (LM)
- Mini Pile
- LLM Merging
- Edge LLM Compression
- Edge LLM Training
- Math Reasoning (Autoformalization, Autoinformalization, Autotheorem Generation)
Mini versions of several tasks are also available for quick testing and development.
Please find the full list of tasks along with their prompts and descriptions here: ML-Research-Agent-Tasks
The AI Research Benchmark Baseline Agent comes equipped with a variety of tools to assist in different AI and machine learning tasks:
-
Bash Tool: Executes bash commands and scripts.
-
Code Tool: Manages code operations including writing, inserting, replacing, and deleting code.
-
GitHub Tool: Interacts with GitHub repositories to get README files, list files, and retrieve file contents.
-
Semantic Scholar Tool: Searches for academic papers, retrieves paper details, citations, and downloads papers.
-
Python Tool: Executes Python code.
-
Return Function Tool: Handles task completion.
-
Scratchpad Tool: Provides a scratchpad for experiment note-taking and temporary storage.
-
Thought Tool: Allows the agent to process and record thoughts.
-
Long-Term Memory Tool: Manages long-term memory storage and retrieval.
These tools can be used individually or in combination to tackle a wide range of AI research and benchmark tasks. The agent can seamlessly switch between tools as needed for complex operations.
- Python 3.x
- Docker (for containerized execution)
-
Clone this repository:
git clone https://github.com/AlgorithmicResearchGroup/ML-Research-Agent.git cd ML-Research-Agent
-
Install dependencies:
pip install -r requirements.txt
To run the agent without Docker, use the following command:
python3 run.py --task_name llm_efficiency --benchmark full_benchmark --provider openai
bash run.sh <image_name> <benchmark> <provider> <gpu_ids> <task_name> <time_limit> <huggingface_token> <env_file_path>
Example:
bash run.sh ghcr.io/algorithmicresearchgroup/ml-research-agent full_benchmark \
openai \
0 \
math_reasoning \
24h \
<huggingface_token> \
/home/ubuntu/.env
For a full list of available tasks and their corresponding Docker run commands, please refer to tasks repo here: ML-Research-Agent-Tasks
Contributions to improve the baseline agent or add new tasks are welcome. Please submit a pull request or open an issue to discuss proposed changes.
AGPL-3.0
For questions or support, please contact Algorithmic Research Group at matt@algorithmicresearchgroup.com