An agent-based evaluation framework for complex code generation.
Requirement-guided multi-dimensional context distillation
- Collecting contextual information based on the stepwise evaluation plan.
Fine-grained scoring and summarization
- Generating evaluation scores and reports through negotiation between multiple judges.
- Provides structured evaluation reports (Markdown & PDF format) with evaluation scores, environment configuration, task requirements, stepwise evaluation results, and overall evaluation results.
- Integrates various kinds of external tools for code evaluation, including dynamic execution, static linter, unit tests, screenshot/interaction, web browsing, and so on.
- Python 3.x
- Docker
- Git
- Clone the repository:
git clone https://github.com/Eshe0922/CodeVisionary.git
- Build the Docker image:
cd docker
docker build -t codevisionary.evaluate .
docker pull eshe1836316339/codevisionary:lint
docker tag eshe1836316339/codevisionary:lint codevisionary.lint
- Install the required dependencies:
pip install -r requirements.txt
npm install --save-dev prettier
apt-get install pandoc
apt-get install texlive-xetex
You can execute the run.sh
script with the following arguments:
SCRIPT_DIR=$(cd "$(dirname "$0")"; pwd)
python3 main.py \
--evaluation_path "${SCRIPT_DIR}/dataset/benchmark_test.jsonl" \
--write_path "${SCRIPT_DIR}/experiments/test" \
--pdf
Where:
--evaluation_path
: Path to the evaluation dataset in JSONL format. This file contains the questions and responses to be evaluated.--write_path
: Directory where the evaluation results and outputs will be saved.--pdf
: (Optional) If specified, the evaluation results will also be exported as a PDF report.
The evaluation dataset should be a JSON Lines file, where each line is a JSON object representing a single evaluation sample. Each object should have the following fields:
id
: (int) Unique identifier for the sample.question
: (str) The coding or evaluation question.response
: (str) The code or answer generated by the model.model
: (str) The name or identifier of the model that generated the response.
Example:
{"id": 4, "question": "Find the maximum element in a list.", "response": "def find_max(lst):\n return max(lst)", "model": "gpt-4"}
agents/
- Agent implementations for code evaluationdataset/
- Datasets used for code evaluationdocker/
- Docker-related configurationsexperiments/
- Experiment resultstools/
- External tools designed for code evaluationutils/
- Utility functions and helper classesmain.py
- Main entry pointrun.sh
- Shell script for executing themain.py
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
@misc{wang2025codevisionaryagentbasedframeworkevaluating,
title={CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation},
author={Xinchen Wang and Pengfei Gao and Chao Peng and Ruida Hu and Cuiyun Gao},
year={2025},
eprint={2504.13472},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2504.13472},
}
MIT