CodeVisionary

An agent-based evaluation framework for complex code generation.

Features

Two-Stage Framework

Requirement-guided multi-dimensional context distillation

Collecting contextual information based on the stepwise evaluation plan.

Fine-grained scoring and summarization

Generating evaluation scores and reports through negotiation between multiple judges.

Detailed Evaluation Report

Provides structured evaluation reports (Markdown & PDF format) with evaluation scores, environment configuration, task requirements, stepwise evaluation results, and overall evaluation results.

Multi Tool Integration

Integrates various kinds of external tools for code evaluation, including dynamic execution, static linter, unit tests, screenshot/interaction, web browsing, and so on.

Prerequisites

Python 3.x
Docker
Git

Installation

Clone the repository:

git clone https://github.com/Eshe0922/CodeVisionary.git

Build the Docker image:

cd docker
docker build -t codevisionary.evaluate .
docker pull eshe1836316339/codevisionary:lint
docker tag eshe1836316339/codevisionary:lint codevisionary.lint

Install the required dependencies:

pip install -r requirements.txt
npm install --save-dev prettier
apt-get install pandoc
apt-get install texlive-xetex

Usage

You can execute the run.sh script with the following arguments:

SCRIPT_DIR=$(cd "$(dirname "$0")"; pwd)
python3 main.py \
  --evaluation_path "${SCRIPT_DIR}/dataset/benchmark_test.jsonl" \
  --write_path "${SCRIPT_DIR}/experiments/test" \
  --pdf

Where:

--evaluation_path: Path to the evaluation dataset in JSONL format. This file contains the questions and responses to be evaluated.
--write_path: Directory where the evaluation results and outputs will be saved.
--pdf: (Optional) If specified, the evaluation results will also be exported as a PDF report.

The evaluation dataset should be a JSON Lines file, where each line is a JSON object representing a single evaluation sample. Each object should have the following fields:

id: (int) Unique identifier for the sample.
question: (str) The coding or evaluation question.
response: (str) The code or answer generated by the model.
model: (str) The name or identifier of the model that generated the response.

Example:

{"id": 4, "question": "Find the maximum element in a list.", "response": "def find_max(lst):\n    return max(lst)", "model": "gpt-4"}

Project Structure

agents/ - Agent implementations for code evaluation
dataset/ - Datasets used for code evaluation
docker/ - Docker-related configurations
experiments/ - Experiment results
tools/ - External tools designed for code evaluation
utils/ - Utility functions and helper classes
main.py - Main entry point
run.sh - Shell script for executing the main.py

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Citation

@misc{wang2025codevisionaryagentbasedframeworkevaluating,
      title={CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation}, 
      author={Xinchen Wang and Pengfei Gao and Chao Peng and Ruida Hu and Cuiyun Gao},
      year={2025},
      eprint={2504.13472},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2504.13472}, 
}

License

MIT

Ackowledgement

https://github.com/Aider-AI/aider

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CodeVisionary

Features

Two-Stage Framework

Detailed Evaluation Report

Multi Tool Integration

Prerequisites

Installation

Usage

Project Structure

Contributing

Citation

License

Ackowledgement

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
agents		agents
dataset		dataset
docker		docker
tools		tools
utils		utils
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
logo.png		logo.png
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
run.sh		run.sh

License

Eshe0922/CodeVisionary

Folders and files

Latest commit

History

Repository files navigation

CodeVisionary

Features

Two-Stage Framework

Detailed Evaluation Report

Multi Tool Integration

Prerequisites

Installation

Usage

Project Structure

Contributing

Citation

License

Ackowledgement

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages