This repository contains the main implementation of DyCodeEval, introduced in our ICML 2025 paper: “DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination.”
🏆 Leaderboard • 💻 Code • 🤗 Hugging Face Dataset • 🔮 Code Kaleidoscope Project
DyCodeEval proposes a novel dynamic benchmarking framework (dynamic evaluation dataset + dynamic metric) for evaluating code large language models (Code LLMs). It leverages a multi-agent cooperation strategy to rewrite existing benchmarks at evaluation time, producing programming problems that are: 1. Semantically equivalent, 2. Diverse, and 3. Non-deterministic. This dynamic generation process helps mitigate data contamination and provides a more robust and faithful assessment of a model's reasoning capabilities.
- Core implementation of DyCodeEval to dynamically rewrite existing programming problems
- Scripts to reproduce all experiments and benchmarks presented in the paper
- Pre-generated benchmark variants for HumanEval and MBPP
- Support for Fine-Tuning Open Source Models Enable dynamic benchmark generation by fine-tuning open-source code models, removing the dependency on the Claude API.
- Simplified DyPass@K Evaluation Script
Provide an easy-to-use, unified script for computing
DyPass@K
, replacing the current set of fragmented scripts.
We provide pre-generated HumanEval and MBPP datasets on Hugging Face.
You can load a dataset using the following code. Here, raw_data_name
specifies the Hugging Face dataset path, split
selects either the Sonnet or Haiku pre-generated set, and random_seed
controls the randomness of the selection.
from utils import load_unique_dataset
raw_data_name = "CM/Dynamic_HumanEvalZero"
split = "Claude3.5_Sonnet" # or "Claude3.5_Haiku"
dataset = load_unique_dataset(raw_data_name, split, group_name, random_seed=random_seed)
Install the necessary libraries through pip install requirement.txt
We are using litellm
as our unfilled framework to invoke each LLM, so first follow the documents to setup each commerical LLM account.
To generate new benchmark problems use LLM agent, run:
python gen_problem.py --agent_id=0 --seed_data_id=0 --scenario_num=10 --context_num=10
agent_id
: Specifies the LLM agent for generation (0
for Claude 3.5 Haiku,1
for Claude 3.5 Sonnet).seed_data_id
: Selects the seed dataset (0
for HumanEval,1
for MBPP).scenario_num
andcontext_num
: Control the number of new data samples generated.
To compute the dynamic metric DyPass@K
, first run python gen_problem.py
to generate multiple versions of the benchmark problems. Next, run python gen_code.py
to generate code for each problem. Finally, execute python eval_pass_K.py
to compute the DyPass@K
metric.
We will provide more streamlined, user-friendly scripts for this process in the near future.
-
src/: Core implementation of the framework
- src/code_llm/: LLM inference framework, including abstractions for both vLLM and LiteLLM (supporting open-source and commercial code LLMs).
- src/data/: Data structures for various code-related tasks.
- src/pt_opt/: Implementations of different prompt templates.
- src/task_mutation/: Algorithms for generating new programming problems.
-
eval_pass_K.py: Computes pass@K.
-
gen_code.py: Generates code for different benchmarks.
-
gen_problem.py: Creates new benchmark problems with DyCodeEval.
-
make_dataset.py: Prepares datasets and pushes them to Hugging Face.
-
overfit_train.py: Overfits code LLMs to simulate data contamination.
-
pt.py: Prompt templates for code generation.
-
utils.py: Utility functions.
@inproceedings{chen2025dynamic,
title={Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models under Data Contamination},
author={Chen, Simin and Pusarla, Pranav and Ray, Baishakhi},
booktitle={Proceedings of the 42nd International Conference on Machine Learning (ICML)},
year={2025},
publisher={PMLR}
}