Source code for our EMNLP 2025 paper: "CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion" [arXiv].
1. Install uv
uv sync
source .venv/bin/activate
Before running scripts, edit the configuration file:
config/config.toml
Then execute the Python scripts sequentially:
python scripts/build_query.py
- Generates query strings from the benchmark dataset.
python scripts/retrieve.py
- Retrieves top-k relevant code blocks using the configured retriever.
python scripts/rerank.py
- Reranks retrieved code blocks based on their estimated importance.
python scripts/build_prompt.py
- Constructs prompts from retrieved code blocks for the code completion generator.
python scripts/inference.py
- Feeds prompts to the generator model.
- You can replace this step with your own inference code.
Input: JSON file containing an array of strings
Output: JSON file containing an array of generated completions.
python scripts/evaluation.py
- Evaluates code completion performance using inference results.
If you find this work helpful, please consider citing our paper:
@inproceedings{coderag2025,
title={CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion},
author={Sheng Zhang, Yifan Ding, Shuquan Lian, Shun Song, Hui Li},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2025}
}
For questions, please open an issue or contact dingyf@stu.xmu.edu.cn.