🐲🔨 exLong: Generating Exceptional Behavior Tests with Large Language Models

exLong is a large language model instruction-tuned from CodeLlama and embeds reasoning about

traces that lead to throw statements
conditional expressions that guard throw statements
non-exceptional behavior tests that execute similar traces

About

This repo hosts the code and data for the following ICSE 2025 paper:

Title: exLong: Generating Exceptional Behavior Tests with Large Language Models

Authors: Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, Milos Gligoric

@inproceedings{ZhangETAL25exLong,
  author = {Zhang, Jiyang and Liu, Yu and Nie, Pengyu and Li, Junyi Jessy and Gligoric, Milos},
  title = {exLong: Generating Exceptional Behavior Tests with Large Language Models},
  booktitle = {International Conference on Software Engineering},
  year = {2025},
}

Quick Start

The exLong dataset is on Hugging Face 🤗!

from datasets import load_dataset

with_name_ds = load_dataset("EngineeringSoftware/exLong-dataset", "with-EBT-name")
no_name_ds = load_dataset("EngineeringSoftware/exLong-dataset", "no-EBT-name")

The exLong model is on Hugging Face 🤗!

pip install transformers accelerate bitsandbytes peft

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig

# Load the base model
base_model_name = "codellama/CodeLlama-7b-Instruct-hf"
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)

# Load the LoRA configuration
peft_model_id = "EngineeringSoftware/exLong"
config = PeftConfig.from_pretrained(peft_model_id, revision="with-etest-name")  # set revision to "no-etest-name" for no EBT name

# Load the LoRA model
model = PeftModel.from_pretrained(base_model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

prompt = """<s>[INST] <<SYS>>
You are a helpful programming assistant and an expert Java programmer. You are helping a user writing exceptional-behavior tests for their Java code.
<</SYS>>

Please complete an exceptional behavior test method in Java to test the method 'factorial' for the exception 'IllegalArgumentException'.
The method to be tested is defined as:
```java
public static long factorial(int n) {
    if (n < 0) {
        throw new IllegalArgumentException("Number must be non-negative.");
    }
    long result = 1;
    for (int i = 1; i <= n; i++) {
        result *= i;
    }
    return result;
}
` ` `
Please only give the new exceptional-behavior test method to complete the following test class. Do NOT use extra libraries or define new helper methods. Return **only** the code in the completion:
```java
public class FactorialTest {
}
` ` `
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Generate code
output = model.generate(
    input_ids=input_ids,
    max_new_tokens=100,
    temperature=0.2,      # Sampling temperature (lower is more deterministic)
    top_p=0.95,           # Top-p (nucleus) sampling
    do_sample=True        # Enable sampling
)

# Decode and print the generated code
generated_code = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Code:")
print(generated_code)

Set Up

Dependencies Set Up

Create conda environment

conda create -n exlong python=3.9
conda activate exlong
pip install -r requirements.txt

We used axolotl to fine-tune the CodeLlama model. If you want to train your own model, install the extra dependencies

# we used an older version of axolotl to train the models
git clone git@github.com:JiyangZhang/axolotl-exlong.git
cd axolotl-exlong/
conda activate exlong
pip install packaging
# set CUDA_HOME
export CUDA_HOME=/opt/apps/cuda/12.0/
pip3 install -e '.[flash-attn,deepspeed]'

Experiments Set Up

Download raw dataset

mkdir -p _work/data/
mkdir -p _work/exp/
mkdir -p _work/setup/

wget -L https://utexas.box.com/shared/static/hfcp4za3j9vp8lh5u8iviadixuxu8080.gz -O raw-data.tar.gz
tar -xzf raw-data.tar.gz -C _work/data/
mv _work/data/etestgen-raw-data-12k _work/data/ne2e

wget -L https://utexas.box.com/shared/static/4m7mntp0ix18dkl1ikkspcmpuvybfs1f.gz -O ne2e-test.tar.gz
tar -xzf ne2e-test.tar.gz -C _work/data/

wget -L https://utexas.box.com/shared/static/y4e52k5x8vk8vcr59lg33gebcg2m1caw.gz -O rq2.tar.gz
tar -xzf rq2.tar.gz -C _work/data/

# netest-diversity
wget -L https://utexas.box.com/shared/static/j417e93j1rdvdqz2yobttygfhucfbkjm.gz -O netest-diversity.tar.gz
tar -xzf netest-diversity.tar.gz -C _work/data/

You should see _work/data/ne2e, _work/data/rq1-eval, _work/data/rq2 and _work/data/netest-diversity.

Prepare dataset and put them in the _work/setup directory

exLong && exlong sample (Table IV & V)

# exlong
inv -e data.setup-model-data --setup-name conditionnestack2e-with-name-ft
inv -e data.setup-model-data --setup-name conditionnestack2e-no-name-ft

# exlong sample
inv -e data.setup-model-data --setup-name conditionnestack2e-all-with-name-ft
inv -e data.setup-model-data --setup-name conditionnestack2e-all-no-name-ft

You should see _work/setup/conditionnestack2e-with-name-ft/, _work/setup/conditionnestack2e-no-name-ft/, _work/setup/conditionnestack2e-all-with-name-ft/, _work/setup/conditionnestack2e-all-no-name-ft/ directories.

Construct prompts for exLong developer-view

exLong

inv -e data.process-codellama-data --setup-name conditionnestack2e-with-name-ft
inv -e data.process-codellama-data --setup-name conditionnestack2e-no-name-ft

Construct prompts for exLong machine-view

mkdir _work/setup/conditionnestack2e-all-no-name-ft/eval/ -p
cp -r _work/data/rq2/ _work/setup/conditionnestack2e-all-no-name-ft/eval/
python -m etestgen.codellama.realDataProcessor --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml process_test_data

You will see _work/setup/conditionnestack2e-all-no-name-ft/eval/rq2/test-conditionnestack2e-all-no-name-ft.jsonl.

Experiments

Training

Training exLong w. EBT name

Note: conditionnestack2e is the setup name for exLong

cd python/
accelerate launch -m axolotl.cli.train configs/axolotl/axolotl-conditionnestack2e-with-name-7b.yaml

You will see checkpoints in directory _work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/

Training exLong w.o. EBT name

cd python/
accelerate launch -m axolotl.cli.train configs/axolotl/axolotl-conditionnestack2e-no-name-7b.yaml
# script to run on TACC
sbatch axolotl-lora-codellama-7b-conditionnestack2e-no-name.sh

You will see checkpoints in directory _work/exp/conditionnestack2e-no-name-ft/lora-codellama-7b/

Running inference exLong for developer-view

cd python/
# Run evaluation on the selected 434 examples in the test set
python -m etestgen.codellama.CodeLLaMA --config_file configs/codellama-7b-conditionnestack2e-with-name-ft.yaml run_gen --split real-test

You will see checkpoints, model outputs in directory _work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/real-test-set-model-outputs.jsonl

Running inference exLong for machine-view

cd python/
python -m etestgen.codellama.CodeLLaMA --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml run_gen
# Evaluation1: all covered projects
python -m etestgen.llm.eval --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml eval_runtime_metrics
# You will see eval results in `results/model-results/conditionnestack2e-all-no-name-ft-lora-codellama-7b-eval-rq2-runtime-metrics.json`
# Evaluation2: intersection projects
python -m etestgen.llm.eval --eval_set rq2 --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml eval_subset_llm_results --subset_id_file ../results/tool-results/intersect-ids.json
# You will see eval results in `results/model-results/conditionnestack2e-all-no-name-ft-lora-codellama-7b-eval-rq2-intersect-runtime-metrics.json`

You will see model generations in directory _work/exp/conditionnestack2e-all-no-name-ft/lora-codellama-7b/rq2-model-outputs.jsonl

Evaluation: compute metrics

Given test cases generated by exLong, this step will evaluate them with metrics like BLEU, CodeBLEU, Test Coverage, etc.

Input/Output info
- Dataset used to evaluation is expected at _work/{test_data}
- Processed LLM prediction is expected at _work/exp/{setup}/{model_name}/test-results
- Similarity metrics result will be written to _work/exp/{setup}/{model_name}/test-out/similarity_metrics_summary.json and results/model-results/{setup}-{exp}-{eval_set}-sim-metrics.json
- Runtime Metrics will be written to results/model-results/{setup}-{exp}-{eval_set}-runtime-metrics.json and individual result will be at _work/exp/{setup}/{model_name}/test-results/metrics.jsonl

To run evaluation on similarity metrics

Run on an individual experiment

python -m etestgen.llm.eval --eval_set test --config_file [/path/to/config/file] eval_llm_sim

To run evaluation on runtime metrics

Run on an individual experiment

python -m etestgen.llm.eval --eval_set test --config_file [/path/to/config/file] eval_runtime_metrics

Ablations on exLong's context

Diversity of the nEBTs

Prepare Dataset

mkdir -p _work/setup/diversity-conditionnestack2e-sample-with-name-ft/real-eval/test/
mkdir -p _work/setup/diversity-conditionnestack2e-all-with-name-ft/real-eval/test/
cp -r _work/data/netest-diversity/* _work/setup/diversity-conditionnestack2e-sample-with-name-ft/real-eval/test/
cp -r _work/data/netest-diversity/* _work/setup/diversity-conditionnestack2e-all-with-name-ft/real-eval/test/
cd python/
python -m etestgen.codellama.DataProcessor --config_file configs/codellama-7b-diversity-conditionnestack2e-sample-with-name-ft.yaml process_real_test_data
python -m etestgen.codellama.DataProcessor --config_file configs/codellama-7b-diversity-conditionnestack2e-all-with-name-ft.yaml process_real_test_data

You will see processed data in _work/setup/diversity-conditionnestack2e-all-with-name-ft/real-eval/test/ and _work/setup/diversity-conditionnestack2e-sample-with-name-ft/real-eval/test/.

Running Inference

# [2nd row] use the same exLong ckpt but try prompting with the same nEBT multiple times
python -m etestgen.codellama.CodeLLaMA --config_file configs/codellama-7b-diversity-conditionnestack2e-sample-with-name-ft.yaml  run_gen --split real-test --target_ckpt ../_work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/
# [3rd row] use the same exLong ckpt but try prompting with different nEBTs
python -m etestgen.codellama.CodeLLaMA --config_file configs/codellama-7b-diversity-conditionnestack2e-all-with-name-ft.yaml  run_gen --split real-test --target_ckpt ../_work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/

You will see model outputs in directory _work/exp/diversity-conditionnestack2e-sample-with-name-ft/lora-codellama-7b/ and _work/exp/diversity-conditionnestack2e-all-with-name-ft/lora-codellama-7b/

exLong w.o. stack trace

Prepare Dataset

inv -e data.setup-model-data --setup-name conditionne2e-with-name-ft
inv -e data.process-codellama-data --setup-name conditionne2e-with-name-ft

Running Inference

python -m etestgen.codellama.CodeLLaMA --config_file python/configs/eval/codellama-7b-conditionne2e-with-name-ft.yaml run_gen

exLong w.o. stack trace & guard expression

Prepare Dataset

inv -e data.setup-model-data --setup-name ne2e-with-name-ft
inv -e data.process-codellama-data --setup-name ne2e-with-name-ft

Running Inference

python -m etestgen.codellama.CodeLLaMA --config_file configs/eval/codellama-7b-ne2e-with-name-ft.yaml run_gen

exLong w.o. stack trace & guard expression & nEBT

Prepare Dataset

inv -e data.setup-model-data --setup-name mut2e-with-name-ft
inv -e data.process-codellama-data --setup-name mut2e-with-name-ft

Running Inference

python -m etestgen.codellama.CodeLLaMA --config_file configs/eval/codellama-7b-mut2e-with-name-ft.yaml run_gen

Artifacts:

Model Checkpoints:

exLong-with-name (7B and 13B): exLong models in Table IV, Table VI and Table VIII.
exLong-no-name (7B): exLong models in Table V.
exLong-with-name w.o. stack trace (7B): exLong no stack trace model in Table VI.
exLong-with-name w.o. stack trace & guard expr (7B): exLong no stack trace & no guard expr model in Table VI.
exLong-with-name w.o. stack trace & guard expr & EBT (7B): exLong no stack trace & no guard expr & no EBT model in Table VI.
exLong-with-name w.o. stack trace & guard expr & EBT (13B): exLong 13B no stack trace & no guard expr & no EBT model in Table VIII.

Dataset:

repos.tar.gz: The repository list from which we collected the dataset.
raw-data.tar.gz: The raw collected data from the open-source repositories. etestgen-raw-data-12k/
ne2e-test.tar.gz: The collected dataset for eval in developer-view. rq1-eval/
machine-view.tar.gz: The collected dataset for eval in machine-view. rq2/
netest-diversity.tar.gz: The collected dataset we use to study how the different nEBTs affect model's performance (Table VII). netest-diversity/
processed dataset: The processed dataset (prompts) to train the exLong models.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
python		python
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐲🔨 exLong: Generating Exceptional Behavior Tests with Large Language Models

About

Table of Contents

Quick Start

Set Up

Dependencies Set Up

Experiments Set Up

Experiments

Training

Evaluation: compute metrics

Ablations on exLong's context

Diversity of the nEBTs

exLong w.o. stack trace

exLong w.o. stack trace & guard expression

exLong w.o. stack trace & guard expression & nEBT

Artifacts:

Model Checkpoints:

Dataset:

About

Releases

Packages

Languages

EngineeringSoftware/exLong

Folders and files

Latest commit

History

Repository files navigation

🐲🔨 exLong: Generating Exceptional Behavior Tests with Large Language Models

About

Table of Contents

Quick Start

Set Up

Dependencies Set Up

Experiments Set Up

Experiments

Training

Evaluation: compute metrics

Ablations on exLong's context

Diversity of the nEBTs

exLong w.o. stack trace

exLong w.o. stack trace & guard expression

exLong w.o. stack trace & guard expression & nEBT

Artifacts:

Model Checkpoints:

Dataset:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages