Skip to content

Eshe0922/CodeVisionary

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeVisionary

CodeVisionary

An agent-based evaluation framework for complex code generation.

Features

Two-Stage Framework

Requirement-guided multi-dimensional context distillation

  • Collecting contextual information based on the stepwise evaluation plan.

Fine-grained scoring and summarization

  • Generating evaluation scores and reports through negotiation between multiple judges.

Detailed Evaluation Report

  • Provides structured evaluation reports (Markdown & PDF format) with evaluation scores, environment configuration, task requirements, stepwise evaluation results, and overall evaluation results.

Multi Tool Integration

  • Integrates various kinds of external tools for code evaluation, including dynamic execution, static linter, unit tests, screenshot/interaction, web browsing, and so on.

Prerequisites

  • Python 3.x
  • Docker
  • Git

Installation

  1. Clone the repository:
git clone https://github.com/Eshe0922/CodeVisionary.git
  1. Build the Docker image:
cd docker
docker build -t codevisionary.evaluate .
docker pull eshe1836316339/codevisionary:lint
docker tag eshe1836316339/codevisionary:lint codevisionary.lint 
  1. Install the required dependencies:
pip install -r requirements.txt
npm install --save-dev prettier
apt-get install pandoc
apt-get install texlive-xetex

Usage

You can execute the run.sh script with the following arguments:

SCRIPT_DIR=$(cd "$(dirname "$0")"; pwd)
python3 main.py \
  --evaluation_path "${SCRIPT_DIR}/dataset/benchmark_test.jsonl" \
  --write_path "${SCRIPT_DIR}/experiments/test" \
  --pdf

Where:

  • --evaluation_path: Path to the evaluation dataset in JSONL format. This file contains the questions and responses to be evaluated.
  • --write_path: Directory where the evaluation results and outputs will be saved.
  • --pdf: (Optional) If specified, the evaluation results will also be exported as a PDF report.

The evaluation dataset should be a JSON Lines file, where each line is a JSON object representing a single evaluation sample. Each object should have the following fields:

  • id: (int) Unique identifier for the sample.
  • question: (str) The coding or evaluation question.
  • response: (str) The code or answer generated by the model.
  • model: (str) The name or identifier of the model that generated the response.

Example:

{"id": 4, "question": "Find the maximum element in a list.", "response": "def find_max(lst):\n    return max(lst)", "model": "gpt-4"}

Project Structure

  • agents/ - Agent implementations for code evaluation
  • dataset/ - Datasets used for code evaluation
  • docker/ - Docker-related configurations
  • experiments/ - Experiment results
  • tools/ - External tools designed for code evaluation
  • utils/ - Utility functions and helper classes
  • main.py - Main entry point
  • run.sh - Shell script for executing the main.py

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Citation

@misc{wang2025codevisionaryagentbasedframeworkevaluating,
      title={CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation}, 
      author={Xinchen Wang and Pengfei Gao and Chao Peng and Ruida Hu and Cuiyun Gao},
      year={2025},
      eprint={2504.13472},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2504.13472}, 
}

License

MIT

Ackowledgement

https://github.com/Aider-AI/aider

About

[ASE'25] An Agent-based Evaluation Framework for Complex Code Generation

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published