- [2025/11/18] π We are excited to release PatchEval, a benchmark for evaluating Large Language Models (LLMs) on real-world vulnerability repair.
PatchEval is a benchmark designed to systematically evaluate LLMs and Agents in the task of automated vulnerability repair. It includes 1,000 vulnerabilities sourced from CVEs reported between 2015 and 2025, covering 65 CWE categories across Go, JavaScript, and Python. A subset of 230 CVEs is paired with Dockerized sandbox environments that enable runtime patch validation through Proof-of-Concept (PoC) and unit testing.
- Operating System: Linux (Tested on Ubuntu 20.04 and 18.04)
- CPU: β₯ 16 cores
- Disk Storage: β₯ 500 GB of free storage.
Note
PatchEval uses Docker as sandbox environments for patch evaluations and agent interactions. We recommend allocating at least 500 GB of disk space, as PatchEval includes 230 Docker containers, each consuming approximately 2 GB of storage.
To get started, first install Docker by following the official Docker Setup Guide. Once Docker is ready, you can build PatchEval from source using the steps below:
git clone https://github.com/XXX
# # We recommend using Miniconda to manage Python environments, but you may use a native Python installation if preferred.
conda create -n patcheval python==3.12
conda activate patcheval
pip install -r requirements.txt./
βββ docs
βββ patcheval
β βββ evaluation # Scripts for evaluation
β βββ datasets # Experimental datasets
β βββ exp_agent # Scripts for agent-based experiments
β βββ exp_llm # Scripts for LLM-based experiments
β βββ log # Experimental Logs
βββ scripts # Scripts for running experiments and validating results
βββ README.md
βββ requirements.txt
PatchEval includes 230 Docker images, where each contains five key files in /workspace for patch validation:
llm.patch: The ground-truth (official) patch.
fix-run.sh: Run fix-run.sh to execute the PoC and verify that the patched vulnerability is no longer exploitable.
vul-run.sh: Run vul-run.sh to execute the PoC and verify that the original (unpatched) vulnerability is exploitable.
unit_test.sh: (if present) Run unit_test.sh to execute unit tests and validate functional correctness.
prepare.sh: Resets all changes in the repository. Run this script before each evaluation.
The vulnerability dataset is located in patcheval/datasets.
The file input.json contains CVE metadata required to run vulnerability repair experiments, such as cve_id, cve_description, programing_language and vul_func.
To download Docker images for patch validation, follow these steps:
cd scripts
python download_images.py- Patch Generation:
You can either follow the steps in Reproducibility or use your own approach to generate vulnerability patches.
After generating patches, convert them into the JSONL format as follows (e.g.,
patcheval/evaluation/example_patch.json) for validation:[ { "cve": "CVE_ID", "fix_patch": "YOUR_PATCH" } ]
- Patch Validation:
python patcheval/evaluation/run_evaluation.py -h # Usage: # `--output`: Path to store experimental results. # `--patch_file`: Path to the patch file. # `--max_workers`: Maximum number of parallel workers for evaluation. Default is `4`. # `--log_level`: Logging level. Default is `INFO`. # `--artifact_eval`: Use this mode only when evaluating results in the patcheval/log/llm directory. Default is `False`. # Example: # cd patcheval/evaluation/ # python run_evaluation.py \ # --output example \ # --patch_file ./example_patch.json \ # --log_level DEBUG
To ensure full reproducibility of the results reported in the paper and facilitate future research, we provide evaluation logs of all experiments (including ablation studies)
- Agentless-based Vulnerability Repair
- Download CVE repositories
cd patcheval/exp_llm/projects
python clone.py- Config LLM API
# patcheval/exp_llm/API-ENV.json
{
"model_name": {
"api_key": "", // your api key
"api_url": "", // base url
"model": "" // endpoint
},
}- Run experiments: You can start evaluation using the default configuration below.
For ablation studies, you can switch between different prompt templates under
patcheval/exp_llm/prompt_templates.
cd patcheval
# running logs are saved in `patcheval/exp_llm/output/logs`
python -m exp_llm.main \
--epochs 1 \
--model your_model_name \
--template ./exp_llm/prompt_templates/Default.txt \
--input ./datasets/input.json \
--local_repo_path ./exp_llm/projects \
--max_workers 5- (Optional) Artifact evaluation: We provide pre-computed evaluation logs of all experiments in
patcheval/log/llm. If you want to evaluate the artifact, you only need to run the following command:
cd patcheval/evaluation
python run_evaluation.py \
--output artifact_eval_gemini2_5 \
--patch_file ../log/llm/fixed_gemini2_5_Default.json \
--artifact_eval- Agent-based Vulnerability Repair (SWEAgent with doubao-1.6 as an example)
- Install SWE-Agent
# Setup Guide: https://swe-agent.com/latest/installation/source/
git clone https://github.com/SWE-agent/SWE-agent.git
cd SWE-agent/
git checkout 8089c8baa55be1b12a61767e9b8e52bb63443b40 && patch -p1 -f < ../sweagent_diff.patch
python -m pip install --upgrade pip && pip install --editable .- Configure your LLM using the template file
configs/template_without_feedback.yaml
cd ../
# Use your own model name to create an evaluation template
cp configs/template_without_feedback.yaml configs/template_doubao_without_feedback.yamlEdit the file configs/template_doubao_without_feedback.yaml to add your LLM endpoint, API key and base url
name: openai/endpoint (e.g., openai/ep-20251031xxxx)
api_base: your_api_base_url
api_key: your_api_key
api_version: only used in AzureOpenAI e.g., `2024-03-01-preview`Register your model in exp_agent/sweagent/SWE-agent/sweagent/run/run.py
# Python
litellm.register_model({
# If you donβt know the exact model name (e.g., doubao-seed-1-6-251015), you can simply use a placeholder like doubao and run the patch generation script below β it will automatically display the correct model name in the error messages.
"openai/doubao-seed-1-6-251015": {
"max_tokens": 8192,
"max_input_tokens": 128000,
"max_output_tokens": 8192,
"input_cost_per_token": 0,
"output_cost_per_token": 0,
"litellm_provider": "openai",
"mode": "chat",
"supports_function_calling": True,
"supports_tool_choice": True
}
})- Run patch generation and validation
# patch generation
bash shells/run_exp1.sh doubao
# patch evaluation
bash shells/run_eval.sh doubao_exp1
# results are saved in patcheval/exp_agent/sweagent/evaluation_output/results/doubao_exp1/summary.json- (Optional) Artifact evaluation: We provide pre-computed evaluation logs of all experiments in
patcheval/log/agent. If you want to evaluate the artifact, you only need to run the following command:
cd patcheval/evaluation
python run_evaluation.py \
--output artifact_eval_swe_agent_gemini \
--patch_file ../log/agent/sweagent/gemini_exp1.jsonl \For other agents, you can refer to the corresponding folder(ClaudeCode, OpenHands) for more details.
We would love to hear from the broader Security, Machine Learning, and Software Engineering communities! Whether you report a bug, suggest an idea, or submit a pull request, just open an issue or PR β weβll get back to you soon!
Contact us: Jun ZENG, Ming WEN
If you find PatchEval useful for your research and applications, feel free to give us a star β or cite us using:
@misc{wei2025patcheval,
title={PATCHEVAL: A New Benchmark for Evaluating LLMs on Patching Real-World Vulnerabilities},
author={Zichao Wei and Jun Zeng and Ming Wen and Zeliang Yu and Kai Cheng and Yiding Zhu and Jingyi Guo and Shiqi Zhou and Le Yin and Xiaodong Su and Zhechao Ma},
year={2025},
eprint={2511.11019},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2511.11019},
}This project is licensed under the Apache License 2.0. See the LICENSE file for more details.
We would like to thank Caiyong Lin, Guangyu Zhou, Sen Cheng, Xufeng Zhou, Ke Sun, Jinhuang Liang, Zhongfu Su, Pengfei Sun, Zequn Fang, and Yongheng Yang at ByteDance for their dedicated efforts in reviewing the quality of the dataset. We thank Zhengqin Luo, Zhi Liu, Zach Zhang, and Yuan Zhang for their valuable feedback and advice. We also thank Shengqiang Li for helping artifact evaluation.

