This is the replication package accompanying our paper, Curating Review Comments for Improved Code Review Automation.
The datasets of this paper are available on Zenodo and HuggingFace π€.
Note
[11-02-2025] π€ CuREV is now available on Hugging Face! You can access it here.
[05-02-2025] π₯ We release the first version of CuREV and the complete replication package of our paper.
[12-01-2025] π’ Our paper has been accepted for MSR 2025! Read it on arXiv.
We propose a methodology to curate a code reviews dataset to enhance its quality and improve the performance of language models on code review downstream tasks, namely comment generation and code refinement.
Overview of our curation pipeline for code reviews to create CuREV dataset (see our paper for more details).
The main contributions of this work are threefold: (1) A data-centric evaluation framework, (2) A curation pipeline to improve the quality of review comments, and (3) Evaluation of the curated dataset, compared to the original, on downstream tasks (i.e, comment generation and code refinement).
- Project structure
- Environment setup
- Data
- Models
- 1- A Data-Centric Evaluation Framework
- 2- CuREV: a Curated Dataset for Code Review
- 3-a. Comment Generation
- 3-b. Code Refinement
- Contributors
- Citation
The project is structured as follows.
.
βββ code_refinement/ # Code refinement package
βββ comment_generation/ # Comment generation package
βββ quality_assessment/ # empirical study package
βββ data_curation/ # dataset curation package
βββ util/ # package for helpers and config
βββ data/ # Folder for dataset and results
βββ models/ # Folder for large language models
βββ requirements.txt # required python libraries
There are two ways to set up the environment to run the project and use the scripts. Choose the method that best suits your workflow: python virtual environment or docker.
This method involves setting up a python virtual environment to isolate the dependencies required by the project. Follow the steps below to create the environment, install the necessary libraries, and verify the installation.
To facilitate usage and results replication, we include a file requirements.txt
to install the required python libraries.
Here are the instructions to create a virtual environment, activate it, and install dependencies using the provided requirements.txt
file:
-
Create a Virtual Environment
Run the following command to create a virtual environment namedvenv
:python3 -m venv venv
-
Activate the Virtual Environment
- On macOS/Linux:
source venv/bin/activate
- On Windows:
.\venv\Scripts\activate
- On macOS/Linux:
-
Install Dependencies
With the virtual environment activated, install the required Python libraries fromrequirements.txt
:pip install -r requirements.txt
-
Verify the Installation
To confirm that all dependencies are installed correctly, run:pip list
-
Deactivating the Environment
When youβre finished, you can deactivate the virtual environment with:deactivate
Alternatively, you can run the project within a Docker container. This setup makes it easy to run the project in a containerized environment.
- Docker installed on your system. You can download it from here.
-
Build the docker image
Open a terminal in the root directory of the project and run the following command to build the Docker image:docker build -t curev .
This will create a Docker image named curev based on the provided Dockerfile.
-
Create the docker container
After building the image, you can create a docker container using:docker run -it --name curev-container curev-image
The above command should be run once (the first time) to create a container from the built image. For subsequent attempts, we should use the following commands to start and connect to the created docker container:
docker start curev-container docker exec -it curev-container bash
To stop the container:
docker stop curev-container
The project is now set up to run within a Docker container for your project, and you can easily start, connect, and stop the container as needed.
-
Verify the Installation
To confirm that all dependencies are installed correctly, run:pip list
The original code review dataset is available in Zenodo.
To run the experiments, you need to download Code_Refinement.zip
and place the dataset under the data/
folder.
You can use the utilities method create_HFdataset in util.dataset
to merge the downloaded jsonl files into a HuggingFace dataset.
We run Llama-3.1-70B on our local machines using ExLlamaV2 to geneerate accurate judgments using our defined evaluation framework.
You can choose the same model or any other model.
Or, you can to download a quantized version of any other model that is compatible with ExLlamaV2.
The downloaded model should be placed under the folder models/
.
We propose an evaluation framework to categorize and assess the quality of code reviews. It consists of (1) a categorization scheme to classify the type, nature, and civility of code review comments, and (2) scoring criteria to assess the overall quality of code reviews based on their relevance, clarity, and conciseness. We apply our evaluation framework to the largest existing dataset of code reviews. Given the scale of the dataset, we utilize a large language model (LLM) as a judge to automatically annotate samples with thoroughly designed prompts to ensure reliable and consistent annotations.
The experiments conducted for this contribution are available under the folder quality_assessment/
.
To run the LLM judgments:
python quality_assessment/inference.py \
--model_dir="models/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4/" \
--dataset_path="data/Code_Refinement/CRdataset" \
--save_steps=5000
The full list of arguments is available in util/config.py
.
The experiments conducted for this contribution are available under the folder data_curation/
.
To run the experiments for reformulating review comments:
python reformulate_reviews/inference.py \
--model_dir="models/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4/" \
--dataset_path="data/Code_Refinement/CRdataset" \
--output_path="data/eval_results/reform_results.jsonl" \
--save_steps=5000
The full list of arguments is available in util/config.py
.
The experiments conducted for this contribution are available under the folder comment_generation/
.
- To train a language model on comment generation on the original dataset:
python comment_generation/sft_init.py \
--model_name_or_path="deepseek-ai/deepseek-coder-6.7b-instruct" \
--dataset_path="data/Code_Refinement/CRdataset" \
--output_path="data/eval_results/reform_results.jsonl" \
--save_steps=200 \
--checkpoint_path="models/comment_generation/init_ckpts" \
--output_path="models/comment_generation/final_model"
- To train a language model on comment generation on the original dataset:
python comment_generation/sft_cur.py \
--model_name_or_path="deepseek-ai/deepseek-coder-6.7b-instruct" \
--dataset_path="data/Code_Refinement/CRdataset_reform" \
--output_path="data/eval_results/reform_results.jsonl" \
--save_steps=200 \
--checkpoint_path="models/comment_generation/init_ckpts" \
--output_path="models/comment_generation/final_model"
- To run the inference on the initial or curated dataset:
python comment_generation/hf_inference-init.py
python comment_generation/hf_inference-cur.py
- To run the evaluation of both model:
python comment_generation/evaluation.py"
- The full list of arguments is available in
util/config.py
.
The experiments conducted for this contribution are available under the folder code_refinement/
.
- To run the inference of a model for code on the initial dataset:
python comment_generation/hf_inference-init.py \
--model_name_or_path="deepseek-ai/deepseek-coder-6.7b-instruct" \
--dataset_path="data/Code_Refinement/CRdataset" \
--output_path="data/eval_results/reform_results.jsonl" \
--save_steps=1000 \
--output_path="models/init_coderef_results.jsonl"
- To run the inference of a model for code on the curated dataset:
python comment_generation/hf_inference-cur.py \
--model_name_or_path="deepseek-ai/deepseek-coder-6.7b-instruct" \
--dataset_path="data/Code_Refinement/CRdataset_reform" \
--output_path="data/eval_results/reform_results.jsonl" \
--save_steps=1000 \
--output_path="models/cur_coderef_results.jsonl"
- To run the evaluation of both model:
python comment_generation/evaluate.py"
- The full list of arguments is available in
util/config.py
.
For questions, collaborations, or further discussion, please feel free to reach out to any of our contributors via email or open an issue on GitHub.
Name | Contact | Github |
---|---|---|
Oussama Ben Sghaier | oussama.ben.sghaier@umontreal.ca |
OussamaSghaier |
Martin Weyssow | mweyssow@smu.edu.sg |
martin-wey |
Houari Sahraoui | sahraouh@iro.umontreal.ca |
@misc{sghaier2025harnessinglargelanguagemodels,
title={Harnessing Large Language Models for Curated Code Reviews},
author={Oussama Ben Sghaier and Martin Weyssow and Houari Sahraoui},
year={2025},
eprint={2502.03425},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2502.03425},
}