This is the official code release accompanying our paper "Benchmarking Language Model Creativity: A Case Study on Code Generation". Our dataset under datasets/CodeForce/NeoCoder
contains:
- NeoCoder dataset: 199 problems with maximum of 5 temporally relevant constraints.
- Historical human solutions: 30 human solutions per problem and their technique detection results (by GPT-4).
- Human annotated test cases: Our manually annotated test cases for fixing certain parsing problems from crawling.
Other supporting files: 500 crawled original codeforces problems and crawled raw test cases under datasets/CodeForce/crawled
.
steps/ // callable scripts correspond to each step of denial prompting and creativity evaluation.
src/ // source code of models, evaluators, data collations, etc.
scripts/ // bash scripts to scale up experiments.
- Setup Zenrows API for scraping:
echo "export ZENROWS_API_KEY='yourkey'" >> ~/.bashrc
- Setup OpenAI API for generations:
echo "export OPENAI_API_KEY='yourkey'" >> ~/.bashrc
- Create environment:
conda create --name creativity python=3.9
- Activate environment:
conda activate creativity
- Setup environment:
pip install -r requirements.txt
If someone wants to use only our NeoCoder dataset to reproduce the results, please run steps Inference and NeoGauge@T Calculation.
Note that the NeoCoder.json
file is originally and automatically saved with the name format of {model_name}_diff={diff}_sample={num_sample}_dp={dp_rounds}.json
. For simplicity purposes, we manually change the name to NeoCoder to match the dataset name in our paper.
- Crawl CodeForce problems:
python steps/crawl_codeforce_problem.py --raw-data-dir datasets/CodeForce/raw/CodeForce800spreadsheet.xlsx --save-dir --num-sample --difficulty
- Crawl human solutions:
python steps/crawl_codeforce_solution.py --crawled-problem-path --save-dir --max-solution-num
- Prepare Test Cases:
python steps/parse_test_case.py --data-path --output-dir
- Manually correcting test cases to match inputs and outputs. We provide our annotated results in
datasets/CodeForce/NeoCoder/test_cases_annotated.json
-
Generate NeoCoder dataset:
python steps/generate_dp.py --problem-set-dir --model-name --num-sample --dp-rounds --output-dir
In our experiment, we generate NeoCoder by GPT-4 using the following script:
bash scripts/generate_dp_dataset.sh
-
Inference on NeoCoder dataset:
python steps/inference_dp.py --dataset-path --model-name {HF_MODEL_NAME, OPENAI_MODEL_NAME} --dp-rounds --batch-size --output-dir
We provide a running example in
scripts/inference_dp_dataset_llama3.slurm
-
Evaluate correctness:
python steps/creativity_evaluation.py --task correctness --inference-result-path --test-case-path --save-folder --model-family
We provide a running example in
scripts/correctness_evaluation.sh
-
Detect Techniques:
python steps/creativity_evaluation.py --task detection --inference-result-path --human-solution-path
We provide a running example in
scripts/detect_techniques.sh
-
Final NeoGauge@T Calculation:
python steps/creativity_evaluation.py --task creativity --inference-result-path --human-solution-path --save-folder
If you use this code, please cite the following paper:
@misc{lu2024benchmarkinglanguagemodelcreativity,
title={Benchmarking Language Model Creativity: A Case Study on Code Generation},
author={Yining Lu and Dixuan Wang and Tianjian Li and Dongwei Jiang and Daniel Khashabi},
year={2024},
eprint={2407.09007},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.09007},
}