🚀 GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging

📰 News

2025.08.28 🎉 We open-sourced RepoMaster — an AI agent that leverages GitHub repos to solve complex real-world tasks.
2025.08.26 🎉 We open-sourced GitTaskBench — a repo-level benchmark & tooling suite for real-world tasks.
2025.08.10 🎉 We open-sourced SE-Agent — a self-evolution trajectory framework for multi-step reasoning.

🔗 Ecosystem: RepoMaster · GitTaskBench · SE-Agent · Team Homepage

🧭 Motivation and Goal

The ultimate vision for AI agents is to enable users to accomplish real-world tasks simply by describing their needs in natural language—leaving all planning and execution to the agent, which delivers the final results autonomously.

⚠️ While existing benchmarks evaluate various agent capabilities, few focus on tasks that reflect genuine real-world practicality, especially those requiring comprehensive understanding and use of full-scale project repositories.

👋 To address this gap, we introduce GitTaskBench. Our benchmark focuses on tasks whose complexity and practical value demand leveraging repository-level code, mirroring how developers solve real problems using existing GitHub projects.

Overview of GitTaskBench. 7 example real-life tasks from different modalities and their evaluations are shown.

🔍 We carefully selected 54 representative tasks with real-world economic value, and for each task, searched and identified a corresponding GitHub repository that meets strict selection criteria (the repository for each task is fixed to ensure benchmark completeness, as some agent frameworks do not support searching for appropriate repositories). This setup allows us to systematically evaluate LLM agents' ability to utilize open-source repositories to solve complex, realistic problems.

👉 By doing so, GitTaskBench offers a more authentic and comprehensive assessment of agent performance in practical, repository-driven environments.

🚀 How to Run

⚡ If you only want to know how to use GitTaskBench, start here.

0. Directory structure

└── QuantaAlpha/GitTaskBench/

├── README.md
├── setup.py
├── requirements.txt
├── Task_Success_Criteria.xlsx   # listed clearly
├── code_base/                   # all used repositories
│   ├── AnimeGANv3/
│   └── ...
├── queries/                     # all task definitions
│   ├── AnimeGANv3_01/
│   │   └── query.json
│   ├── AnimeGANv3_02/
│   │   └── query.json
│   └── ...
├── run_auto_prompt/             # generate all prompts
│   ├── new_run_setup.py
│   └── get_new_run_prompt.sh
├── Aider/                       # agent framework
│   └── ... 
├── SWE_agent/                   # agent framework
│   └── ... 
├── OpenHands/                   # agent framework
│   └── ...
├── config/                      # task evaluation configs
│   ├── AnimeGANv3_01/
│   │   └── task_info.yaml
│   ├── AnimeGANv3_02/
│   │   └── task_info.yaml
│   ├── AnimeGANv3_03/
│   └── ...
├── groundtruth/                 # ground truth
│   ├── Trafilatura_02/
│   │   └── gt.md
│   └── Trafilatura_03/...
├── output_for_show/             #  agent's outputs
│   ├── AnimeGANv3_01/
│   │   └── output.png
│   └── AnimeGANv3_02/...
├── gittaskbench/                # evaluation settings
│   ├── __init__.py
│   └── ...
├── test_scripts/                # test scripts
│   ├── AnimeGANv3_01/
│   │   └── test_script.py
│   ├── AnimeGANv3_02/
│   │   └── test_script.py
│   └──...
├── test_results_for_show/       # analysis results
│   ├── AnimeGANv3_02/
│   │   └── results.jsonl
│   └──...
└── test_reports/                # summary report
    ├── evaluation_report_openhands_gpt4o_100iters.txt
    ├── evaluation_report_openhands_gpt4o_70iters.txt
    ├── evaluation_report_openhands_gpt4o_30iters.txt
    └── ...

1. Set Up ⚙️

GitTaskBench offers easy-to-use shell commands to ensure reproducible evaluations. To build GitTaskBench from source, follow bellow steps.

First, create a new conda environment:

conda create -n gittaskbench python=3.10 -y
conda activate gittaskbench

pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 \
  --extra-index-url https://download.pytorch.org/whl/cu113

Then, you can install gittaskbench with pip:

git clone https://github.com/your-org/GitTaskBench.git
cd GitTaskBench
# config
pip install -e .

also you can

# config
pip install -r requirements.txt

2. Quick Start 💡

Single Task Evaluation:

If you need to evaluate a single, specific task, you can use the following command. The example below shows how to evaluate the Trafilatura_01 task:

cd GitTaskBench
# The outputs are saved in the DEFAULT "./output" directory, for example: "./output/Trafilatura_01/output.txt"

gittaskbench grade --taskid Trafilatura_01

Running the command will produce an analysis report (.jsonl) at the DEFAULT path (./test_results/Trafilatura_01). See test_results_for_show/ for a sample.

The complete commands can be found in the 🤖 Automation Evaluation section.

All Tasks Evaluation

When you need to evaluate all tasks, you can use the --all parameter. This command will automatically iterate through and execute the evaluation of all tasks:

gittaskbench grade --all

Test Results Analysis

After completing the evaluation, if you want to analyze & summary the test results, you can use the statistics command. This command will analyze & summary the evaluation results in the specified directory and output an analysis report (.txt):

gittaskbench eval

See test_reports/ for a sample.

👉 That’s it. With the above commands you can run, and analyze the agent performance on GitTaskBench.

📊 Benchmark Overview

GitTaskBench is a comprehensive benchmark designed to evaluate the capabilities of intelligent agents across multiple modalities and task complexities. It encompasses 54 tasks spanning 7 key domains.

Each domain features a curated set of tasks that reflect real-world applications and research challenges. These tasks assess an agent's autonomous ability to interpret complex instructions, process multi-modal inputs, perform reasoning, understand and explore the GitHub repositories, and deliver accurate, meaningful outputs.

The GitTaskBench data curation and processing pipeline is illustrated below.

Overview of the GitTaskBench data curation and processing pipeline.

✅ Task Distribution

Domain	Task List
Image Processing	Style Transfer, Image Coloring, Image Restoration, Scratch Detection, Image Enhancement, Background Processing, Watermark Embedding
Video Processing	Video Action Analysis, Style Transfer, Video Coloring
Speech Processing	Speech Recognition, Speech Separation, Speech Enhancement, Noise Reduction, Speech Analysis
Physiological Signals Processing	EDA (Electrodermal Activity) Data Analysis, ECG (Electrocardiogram) Data Analysis, EOG (Electrooculogram) Data Analysis
Security and Privacy	Data Simulation, Watermark Embedding, Watermark Extraction
Web Scraping	Web Content Extraction, Format Transformation
Office Document Processing	Excel Document Parsing, PDF Content Extraction, PDF Content Processing

Task Domains and Summary Statistics.

🛠️ Integrating with Agent Frameworks

We provide detailed configuration guidelines on how to integrate GitTaskBench with existing state-of-the-art general-purpose agent frameworks, including OpenHands, SWE-Agent and Aider. This enables users to seamlessly run batches of benchmark tasks within their agent pipelines.

In fact, the batch runner we provide—designed to enable efficient execution of multiple tasks—is not limited to GitTaskBench, and can be broadly applied to other benchmarks and agent-based task suites as well.

👉 Configuration details for each agent framework are provided in the following files:

For OpenHands, see:
- OpenHands Configuration Guide
```
cd OpenHands
poetry run python run_batch.py
```
For SWE-Agent, see:
- SWE-Agent Configuration Guide 1
- SWE-Agent Configuration Guide 2
```
cd SWE_agent
bash run_batch.sh
```
For Aider, carefully review the configuration settings in run_aider_batch_litellm.sh, then run the command:
```
cd Aider
bash run_aider_batch_litellm.sh
```

🤖 Automation Evaluation

After finishing the 🚀 Set Up preparation, you can explore the complete usage of gittaskbench for automatiion evaluation:

gittaskbench [-v] grade --taskid <taskid> [--output_dir <output_dir>] [--result <result>]

🔧 Options:

--taskid <taskid> : (Required in single task evaluation) The task identifier, e.g., Trafilatura_01.
-v : (Optional) Enable verbose output to display detailed error messages.
--output_dir : (Optional) The directory containing the agent's output files. If not specified, the default value is read from task_info.yaml.
--result :(Optional) The directory containing the agent's test results files. If not specified, the default value is read from task_info.yaml.

gittaskbench eval  [--result <result>]

🔧 Options:

--result :(Optional) The directory containing the agent's test results files. If not specified, the default value is test_results file in repo.

🔬 Evaluation Results

GitTaskBench evaluates two key aspects:

Execution Completion Rate: measures whether the agent can leverage the repository to produce any valid output.
Task Pass Rate : assesses whether the output meets task-specific evaluation criteria.

Given the diversity of tasks, all evaluation metrics are predefined and tailored to each task, drawing on commonly accepted standards within the developer community. This ensures a comprehensive and fair assessment.

Performance Comparison of Different Frameworks and LLMs on GitTaskBench.

Regarding the domain-specific performance,

Performance Evaluation of GPT-4o, GPT-4.1, Claude 3.5, DeepSeek V3 across Different Task Domains.

📝 Application Cases

📄 Case 1: PDF Email Extraction

task = """
Extract all email addresses found in the given PDF and save them to a text file.
Input file: /path/to/document.pdf
Output requirement: Save as output.txt
"""

# evaluation metrics = """
# Process: True/False (Data Integrity Check)  
#     -- Confirms prediction file and ground truth file accessibility  
#     -- Validates file parsing success (no read errors)  

# Result: True/False (Performance Threshold)  
#     -- Calculates accuracy: (Correct Emails / Total Ground Truth) ×100%  
#     -- Applies pass criterion: Accuracy ≥98%   
# """

🎥 Case 2: Video Coloring

task = """
Colorize the provided black-and-white video to produce a fully colored version.
Input file: /path/to/black_and_white_video
Output requirement: Output video is named as "output"
"""

# evaluation metrics = """
# Process: True/False (Technical Validity Verification)  
#     -- Verifies input video file existence and non-empty status  
#     -- Checks format compatibility (.mp4/.avi/.mov/.mkv)  
#     -- Validates frame extraction capability  

# Result: True/False (Color Intensity Threshold)  
#     -- Samples 30 frames with standardized width (256px)  
#     -- Computes per-frame colorfulness via Hasler-Bülthoff metric  
#     -- Aggregates scores to calculate video-level average  
#     -- Pass/Fail determination (Threshold: >10.0 )  
# """

🖼️ Case 3: Image Watermark Embedding

task = """
Embed a blind (invisible) watermark into the given PNG image.
Input file: /path/to/image.png
Output requirement: Save as output.png
"""

# evaluation metrics = """
#  Process: True/False (Input Validation)  
#     -- Verifies existence and non-empty status of original/watermarked images  
#     -- Checks image file integrity (readable formats via OpenCV)  

# Result: True/False (Watermark & Quality Compliance)  
#     -- Extracts watermark text using DWT-DCT decoding  
#     -- Matches extracted text against ground truth (100% match required)  
#     -- Computes PSNR between original and watermarked images (≥30.0 dB threshold)  
#     -- Final pass requires both watermark match AND PSNR compliance  
# """

✨ Key Features:

Multi-Modal Support: Encompasses vision, language, audio, time-series, and web-based data.
Diverse Task Types: Includes generation, recognition, enhancement, analysis, and simulation tasks.
Real-World Relevance: Tasks are derived from practical applications in media, healthcare, automation, and data science.
Scalability: Designed for future expansion with new tasks and evaluation metrics.

🤝 Contributing

We welcome community contributions! Please refer to the following guidelines:

Development Setup

git clone https://github.com/your-org/GitTaskBench.git
cd GitTaskBench

To learn more about automation evaluation, please refer to the 🚀 Set Up section.

Contribution Types

🐛 Bug fixes
✨ New feature development
📚 Documentation improvements
🧪 Test case additions
🔧 Repos and utilities

Submission Process

Fork the project and create a feature branch
Write code and tests
Ensure all tests pass
Submit Pull Request

🌐 About QuantaAlpha

QuantaAlpha was founded in April 2025 by a team of professors, postdocs, PhDs, and master's students from Tsinghua University, Peking University, CAS, CMU, HKUST, and more.

🌟 Our mission is to explore the "quantum" of intelligence and pioneer the "alpha" frontier of agent research — from CodeAgents to self-evolving intelligence, and further to financial and cross-domain specialized agents, we are committed to redefining the boundaries of AI.

✨ In 2025, we will continue to produce high-quality research in the following directions:

CodeAgent: End-to-end autonomous execution of real-world tasks
DeepResearch: Deep reasoning and retrieval-augmented intelligence
Agentic Reasoning / Agentic RL: Agent-based reasoning and reinforcement learning
Self-evolution and collaborative learning: Evolution and coordination of multi-agent systems

📢 We welcome students and researchers interested in these directions to join us!

🔗 Team Homepage: QuantaAlpha

📖 Citation

If you find GitTaskBench useful in your research, please cite our work:

@misc{ni2025gittaskbench,
      title={GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging}, 
      author={Ziyi Ni and Huacan Wang and Shuo Zhang and Shuo Lu and Ziyang He and Wang You and Zhenheng Tang and Yuntao Du and Bill Sun and Hongzhang Liu and Sen Hu and Ronghao Chen and Bo Li and Xin Li and Chen Hu and Binxing Jiao and Daxin Jiang and Pin Lyu},
      year={2025},
      eprint={2508.18993},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2508.18993}, 
}

⭐ If GitTaskBench helps you, please give us a star!

Made with ❤️ by the GitTaskBench Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging

📰 News

🧭 Motivation and Goal

🚀 How to Run

0. Directory structure

1. Set Up ⚙️

2. Quick Start 💡

Single Task Evaluation:

All Tasks Evaluation

Test Results Analysis

📊 Benchmark Overview

✅ Task Distribution

🛠️ Integrating with Agent Frameworks

🤖 Automation Evaluation

🔧 Options:

🔧 Options:

🔬 Evaluation Results

📝 Application Cases

✨ Key Features:

🤝 Contributing

Development Setup

Contribution Types

Submission Process

🌐 About QuantaAlpha

📖 Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.vscode		.vscode
Aider		Aider
OpenHands		OpenHands
SWE_agent		SWE_agent
code_base		code_base
config		config
figs		figs
gittaskbench.egg-info		gittaskbench.egg-info
gittaskbench		gittaskbench
groundtruth		groundtruth
output_for_show		output_for_show
queries		queries
run_auto_prompt		run_auto_prompt
test_reports		test_reports
test_results_for_show		test_results_for_show
test_scripts		test_scripts
README.md		README.md
Task_Success_Criteria.xlsx		Task_Success_Criteria.xlsx
requirements.txt		requirements.txt
setup.py		setup.py

QuantaAlpha/GitTaskBench

Folders and files

Latest commit

History

Repository files navigation

🚀 GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging

📰 News

🧭 Motivation and Goal

🚀 How to Run

0. Directory structure

1. Set Up ⚙️

2. Quick Start 💡

Single Task Evaluation:

All Tasks Evaluation

Test Results Analysis

📊 Benchmark Overview

✅ Task Distribution

🛠️ Integrating with Agent Frameworks

🤖 Automation Evaluation

🔧 Options:

🔧 Options:

🔬 Evaluation Results

📝 Application Cases

✨ Key Features:

🤝 Contributing

Development Setup

Contribution Types

Submission Process

🌐 About QuantaAlpha

📖 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages