📈 DyCodeEval

This repository contains the main implementation of DyCodeEval, introduced in our ICML 2025 paper: “DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination.”

🏆 Leaderboard • 💻 Code • 🤗 Hugging Face Dataset • 🔮 Code Kaleidoscope Project

Introducation

DyCodeEval proposes a novel dynamic benchmarking framework (dynamic evaluation dataset + dynamic metric) for evaluating code large language models (Code LLMs). It leverages a multi-agent cooperation strategy to rewrite existing benchmarks at evaluation time, producing programming problems that are: 1. Semantically equivalent, 2. Diverse, and 3. Non-deterministic. This dynamic generation process helps mitigate data contamination and provides a more robust and faithful assessment of a model's reasoning capabilities.

Design Overview

🔧 This repository provides:

Core implementation of DyCodeEval to dynamically rewrite existing programming problems
Scripts to reproduce all experiments and benchmarks presented in the paper
Pre-generated benchmark variants for HumanEval and MBPP

Future Features

Support for Fine-Tuning Open Source Models Enable dynamic benchmark generation by fine-tuning open-source code models, removing the dependency on the Claude API.
Simplified DyPass@K Evaluation Script Provide an easy-to-use, unified script for computing DyPass@K, replacing the current set of fragmented scripts.

🤗 Pre-generated problems

We provide pre-generated HumanEval and MBPP datasets on Hugging Face. You can load a dataset using the following code. Here, raw_data_name specifies the Hugging Face dataset path, split selects either the Sonnet or Haiku pre-generated set, and random_seed controls the randomness of the selection.

from utils import load_unique_dataset

raw_data_name = "CM/Dynamic_HumanEvalZero"
split = "Claude3.5_Sonnet"  # or "Claude3.5_Haiku"
dataset = load_unique_dataset(raw_data_name, split, group_name, random_seed=random_seed)

How to Run

Installation

Install the necessary libraries through pip install requirement.txt

Setup commerical LLM account

We are using litellm as our unfilled framework to invoke each LLM, so first follow the documents to setup each commerical LLM account.

Generating Dynamic Benchmark Problems

To generate new benchmark problems use LLM agent, run:

python gen_problem.py --agent_id=0 --seed_data_id=0 --scenario_num=10 --context_num=10

agent_id: Specifies the LLM agent for generation (0 for Claude 3.5 Haiku, 1 for Claude 3.5 Sonnet).
seed_data_id: Selects the seed dataset (0 for HumanEval, 1 for MBPP).
scenario_num and context_num: Control the number of new data samples generated.

Dynamic Metric (DyPass@K)

To compute the dynamic metric DyPass@K, first run python gen_problem.py to generate multiple versions of the benchmark problems. Next, run python gen_code.py to generate code for each problem. Finally, execute python eval_pass_K.py to compute the DyPass@K metric.

We will provide more streamlined, user-friendly scripts for this process in the near future.

File Structure

src/: Core implementation of the framework
- src/code_llm/: LLM inference framework, including abstractions for both vLLM and LiteLLM (supporting open-source and commercial code LLMs).
- src/data/: Data structures for various code-related tasks.
- src/pt_opt/: Implementations of different prompt templates.
- src/task_mutation/: Algorithms for generating new programming problems.
eval_pass_K.py: Computes pass@K.
gen_code.py: Generates code for different benchmarks.
gen_problem.py: Creates new benchmark problems with DyCodeEval.
make_dataset.py: Prepares datasets and pushes them to Hugging Face.
overfit_train.py: Overfits code LLMs to simulate data contamination.
pt.py: Prompt templates for code generation.
utils.py: Utility functions.

Citation

@inproceedings{chen2025dynamic,
  title={Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models under Data Contamination},
  author={Chen, Simin and Pusarla, Pranav and Ray, Baishakhi},
  booktitle={Proceedings of the 42nd International Conference on Machine Learning (ICML)},
  year={2025},
  publisher={PMLR}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📈 DyCodeEval

Introducation

Design Overview

🔧 This repository provides:

Future Features

🤗 Pre-generated problems

How to Run

Installation

Setup commerical LLM account

Generating Dynamic Benchmark Problems

Dynamic Metric (DyPass@K)

File Structure

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
resource		resource
src		src
README.md		README.md
eval_pass_K.py		eval_pass_K.py
gen_code.py		gen_code.py
gen_problem.py		gen_problem.py
make_dataset.py		make_dataset.py
overfit_train.py		overfit_train.py
post.py		post.py
pt.py		pt.py
utils.py		utils.py

SeekingDream/DyCodeEval

Folders and files

Latest commit

History

Repository files navigation

📈 DyCodeEval

Introducation

Design Overview

🔧 This repository provides:

Future Features

🤗 Pre-generated problems

How to Run

Installation

Setup commerical LLM account

Generating Dynamic Benchmark Problems

Dynamic Metric (DyPass@K)

File Structure

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages