Skip to content

exLong: Generating Exceptional Behavior Tests with Large Language Models

Notifications You must be signed in to change notification settings

EngineeringSoftware/exLong

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🐲🔨 exLong: Generating Exceptional Behavior Tests with Large Language Models

exLong is a large language model instruction-tuned from CodeLlama and embeds reasoning about

  • traces that lead to throw statements
  • conditional expressions that guard throw statements
  • non-exceptional behavior tests that execute similar traces

About

This repo hosts the code and data for the following ICSE 2025 paper:

Title: exLong: Generating Exceptional Behavior Tests with Large Language Models

Authors: Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, Milos Gligoric

@inproceedings{ZhangETAL25exLong,
  author = {Zhang, Jiyang and Liu, Yu and Nie, Pengyu and Li, Junyi Jessy and Gligoric, Milos},
  title = {exLong: Generating Exceptional Behavior Tests with Large Language Models},
  booktitle = {International Conference on Software Engineering},
  year = {2025},
}

Table of Contents

  1. Quick Start 🤗
  2. Set Up 🚀
  3. Experiments 👷
  4. Artifacts

Quick Start

  • The exLong dataset is on Hugging Face 🤗!
from datasets import load_dataset

with_name_ds = load_dataset("EngineeringSoftware/exLong-dataset", "with-EBT-name")
no_name_ds = load_dataset("EngineeringSoftware/exLong-dataset", "no-EBT-name")
  • The exLong model is on Hugging Face 🤗!
pip install transformers accelerate bitsandbytes peft
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig

# Load the base model
base_model_name = "codellama/CodeLlama-7b-Instruct-hf"
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)

# Load the LoRA configuration
peft_model_id = "EngineeringSoftware/exLong"
config = PeftConfig.from_pretrained(peft_model_id, revision="with-etest-name")  # set revision to "no-etest-name" for no EBT name

# Load the LoRA model
model = PeftModel.from_pretrained(base_model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

prompt = """<s>[INST] <<SYS>>
You are a helpful programming assistant and an expert Java programmer. You are helping a user writing exceptional-behavior tests for their Java code.
<</SYS>>

Please complete an exceptional behavior test method in Java to test the method 'factorial' for the exception 'IllegalArgumentException'.
The method to be tested is defined as:
```java
public static long factorial(int n) {
    if (n < 0) {
        throw new IllegalArgumentException("Number must be non-negative.");
    }
    long result = 1;
    for (int i = 1; i <= n; i++) {
        result *= i;
    }
    return result;
}
` ` `
Please only give the new exceptional-behavior test method to complete the following test class. Do NOT use extra libraries or define new helper methods. Return **only** the code in the completion:
```java
public class FactorialTest {
}
` ` `
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Generate code
output = model.generate(
    input_ids=input_ids,
    max_new_tokens=100,
    temperature=0.2,      # Sampling temperature (lower is more deterministic)
    top_p=0.95,           # Top-p (nucleus) sampling
    do_sample=True        # Enable sampling
)

# Decode and print the generated code
generated_code = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Code:")
print(generated_code)

Set Up

Dependencies Set Up

  1. Create conda environment
conda create -n exlong python=3.9
conda activate exlong
pip install -r requirements.txt
  1. We used axolotl to fine-tune the CodeLlama model. If you want to train your own model, install the extra dependencies
# we used an older version of axolotl to train the models
git clone git@github.com:JiyangZhang/axolotl-exlong.git
cd axolotl-exlong/
conda activate exlong
pip install packaging
# set CUDA_HOME
export CUDA_HOME=/opt/apps/cuda/12.0/
pip3 install -e '.[flash-attn,deepspeed]'

Experiments Set Up

  1. Download raw dataset
mkdir -p _work/data/
mkdir -p _work/exp/
mkdir -p _work/setup/

wget -L https://utexas.box.com/shared/static/hfcp4za3j9vp8lh5u8iviadixuxu8080.gz -O raw-data.tar.gz
tar -xzf raw-data.tar.gz -C _work/data/
mv _work/data/etestgen-raw-data-12k _work/data/ne2e

wget -L https://utexas.box.com/shared/static/4m7mntp0ix18dkl1ikkspcmpuvybfs1f.gz -O ne2e-test.tar.gz
tar -xzf ne2e-test.tar.gz -C _work/data/

wget -L https://utexas.box.com/shared/static/y4e52k5x8vk8vcr59lg33gebcg2m1caw.gz -O rq2.tar.gz
tar -xzf rq2.tar.gz -C _work/data/

# netest-diversity
wget -L https://utexas.box.com/shared/static/j417e93j1rdvdqz2yobttygfhucfbkjm.gz -O netest-diversity.tar.gz
tar -xzf netest-diversity.tar.gz -C _work/data/

You should see _work/data/ne2e, _work/data/rq1-eval, _work/data/rq2 and _work/data/netest-diversity.

  1. Prepare dataset and put them in the _work/setup directory
  • exLong && exlong sample (Table IV & V)
# exlong
inv -e data.setup-model-data --setup-name conditionnestack2e-with-name-ft
inv -e data.setup-model-data --setup-name conditionnestack2e-no-name-ft

# exlong sample
inv -e data.setup-model-data --setup-name conditionnestack2e-all-with-name-ft
inv -e data.setup-model-data --setup-name conditionnestack2e-all-no-name-ft

You should see _work/setup/conditionnestack2e-with-name-ft/, _work/setup/conditionnestack2e-no-name-ft/, _work/setup/conditionnestack2e-all-with-name-ft/, _work/setup/conditionnestack2e-all-no-name-ft/ directories.

  1. Construct prompts for exLong developer-view
  • exLong
inv -e data.process-codellama-data --setup-name conditionnestack2e-with-name-ft
inv -e data.process-codellama-data --setup-name conditionnestack2e-no-name-ft
  1. Construct prompts for exLong machine-view
mkdir _work/setup/conditionnestack2e-all-no-name-ft/eval/ -p
cp -r _work/data/rq2/ _work/setup/conditionnestack2e-all-no-name-ft/eval/
python -m etestgen.codellama.realDataProcessor --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml process_test_data

You will see _work/setup/conditionnestack2e-all-no-name-ft/eval/rq2/test-conditionnestack2e-all-no-name-ft.jsonl.

Experiments

Training

  1. Training exLong w. EBT name

Note: conditionnestack2e is the setup name for exLong

cd python/
accelerate launch -m axolotl.cli.train configs/axolotl/axolotl-conditionnestack2e-with-name-7b.yaml

You will see checkpoints in directory _work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/

  1. Training exLong w.o. EBT name
cd python/
accelerate launch -m axolotl.cli.train configs/axolotl/axolotl-conditionnestack2e-no-name-7b.yaml
# script to run on TACC
sbatch axolotl-lora-codellama-7b-conditionnestack2e-no-name.sh

You will see checkpoints in directory _work/exp/conditionnestack2e-no-name-ft/lora-codellama-7b/

  1. Running inference exLong for developer-view
cd python/
# Run evaluation on the selected 434 examples in the test set
python -m etestgen.codellama.CodeLLaMA --config_file configs/codellama-7b-conditionnestack2e-with-name-ft.yaml run_gen --split real-test

You will see checkpoints, model outputs in directory _work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/real-test-set-model-outputs.jsonl

  1. Running inference exLong for machine-view
cd python/
python -m etestgen.codellama.CodeLLaMA --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml run_gen
# Evaluation1: all covered projects
python -m etestgen.llm.eval --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml eval_runtime_metrics
# You will see eval results in `results/model-results/conditionnestack2e-all-no-name-ft-lora-codellama-7b-eval-rq2-runtime-metrics.json`
# Evaluation2: intersection projects
python -m etestgen.llm.eval --eval_set rq2 --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml eval_subset_llm_results --subset_id_file ../results/tool-results/intersect-ids.json
# You will see eval results in `results/model-results/conditionnestack2e-all-no-name-ft-lora-codellama-7b-eval-rq2-intersect-runtime-metrics.json`

You will see model generations in directory _work/exp/conditionnestack2e-all-no-name-ft/lora-codellama-7b/rq2-model-outputs.jsonl

Evaluation: compute metrics

Given test cases generated by exLong, this step will evaluate them with metrics like BLEU, CodeBLEU, Test Coverage, etc.

  • Input/Output info

    • Dataset used to evaluation is expected at _work/{test_data}
    • Processed LLM prediction is expected at _work/exp/{setup}/{model_name}/test-results
    • Similarity metrics result will be written to _work/exp/{setup}/{model_name}/test-out/similarity_metrics_summary.json and results/model-results/{setup}-{exp}-{eval_set}-sim-metrics.json
    • Runtime Metrics will be written to results/model-results/{setup}-{exp}-{eval_set}-runtime-metrics.json and individual result will be at _work/exp/{setup}/{model_name}/test-results/metrics.jsonl
  • To run evaluation on similarity metrics

    • Run on an individual experiment

      python -m etestgen.llm.eval --eval_set test --config_file [/path/to/config/file] eval_llm_sim
  • To run evaluation on runtime metrics

    • Run on an individual experiment

      python -m etestgen.llm.eval --eval_set test --config_file [/path/to/config/file] eval_runtime_metrics

Ablations on exLong's context

Diversity of the nEBTs

  1. Prepare Dataset
mkdir -p _work/setup/diversity-conditionnestack2e-sample-with-name-ft/real-eval/test/
mkdir -p _work/setup/diversity-conditionnestack2e-all-with-name-ft/real-eval/test/
cp -r _work/data/netest-diversity/* _work/setup/diversity-conditionnestack2e-sample-with-name-ft/real-eval/test/
cp -r _work/data/netest-diversity/* _work/setup/diversity-conditionnestack2e-all-with-name-ft/real-eval/test/
cd python/
python -m etestgen.codellama.DataProcessor --config_file configs/codellama-7b-diversity-conditionnestack2e-sample-with-name-ft.yaml process_real_test_data
python -m etestgen.codellama.DataProcessor --config_file configs/codellama-7b-diversity-conditionnestack2e-all-with-name-ft.yaml process_real_test_data

You will see processed data in _work/setup/diversity-conditionnestack2e-all-with-name-ft/real-eval/test/ and _work/setup/diversity-conditionnestack2e-sample-with-name-ft/real-eval/test/.

  1. Running Inference
# [2nd row] use the same exLong ckpt but try prompting with the same nEBT multiple times
python -m etestgen.codellama.CodeLLaMA --config_file configs/codellama-7b-diversity-conditionnestack2e-sample-with-name-ft.yaml  run_gen --split real-test --target_ckpt ../_work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/
# [3rd row] use the same exLong ckpt but try prompting with different nEBTs
python -m etestgen.codellama.CodeLLaMA --config_file configs/codellama-7b-diversity-conditionnestack2e-all-with-name-ft.yaml  run_gen --split real-test --target_ckpt ../_work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/

You will see model outputs in directory _work/exp/diversity-conditionnestack2e-sample-with-name-ft/lora-codellama-7b/ and _work/exp/diversity-conditionnestack2e-all-with-name-ft/lora-codellama-7b/

exLong w.o. stack trace

  1. Prepare Dataset
inv -e data.setup-model-data --setup-name conditionne2e-with-name-ft
inv -e data.process-codellama-data --setup-name conditionne2e-with-name-ft
  1. Running Inference
python -m etestgen.codellama.CodeLLaMA --config_file python/configs/eval/codellama-7b-conditionne2e-with-name-ft.yaml run_gen

exLong w.o. stack trace & guard expression

  1. Prepare Dataset
inv -e data.setup-model-data --setup-name ne2e-with-name-ft
inv -e data.process-codellama-data --setup-name ne2e-with-name-ft
  1. Running Inference
python -m etestgen.codellama.CodeLLaMA --config_file configs/eval/codellama-7b-ne2e-with-name-ft.yaml run_gen

exLong w.o. stack trace & guard expression & nEBT

  1. Prepare Dataset
inv -e data.setup-model-data --setup-name mut2e-with-name-ft
inv -e data.process-codellama-data --setup-name mut2e-with-name-ft
  1. Running Inference
python -m etestgen.codellama.CodeLLaMA --config_file configs/eval/codellama-7b-mut2e-with-name-ft.yaml run_gen

Artifacts:

Model Checkpoints:

Dataset:

  • repos.tar.gz: The repository list from which we collected the dataset.
  • raw-data.tar.gz: The raw collected data from the open-source repositories. etestgen-raw-data-12k/
  • ne2e-test.tar.gz: The collected dataset for eval in developer-view. rq1-eval/
  • machine-view.tar.gz: The collected dataset for eval in machine-view. rq2/
  • netest-diversity.tar.gz: The collected dataset we use to study how the different nEBTs affect model's performance (Table VII). netest-diversity/
  • processed dataset: The processed dataset (prompts) to train the exLong models.

About

exLong: Generating Exceptional Behavior Tests with Large Language Models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published