exLong is a large language model instruction-tuned from CodeLlama and embeds reasoning about
- traces that lead to throw statements
- conditional expressions that guard throw statements
- non-exceptional behavior tests that execute similar traces
This repo hosts the code and data for the following ICSE 2025 paper:
Title: exLong: Generating Exceptional Behavior Tests with Large Language Models
Authors: Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, Milos Gligoric
@inproceedings{ZhangETAL25exLong,
author = {Zhang, Jiyang and Liu, Yu and Nie, Pengyu and Li, Junyi Jessy and Gligoric, Milos},
title = {exLong: Generating Exceptional Behavior Tests with Large Language Models},
booktitle = {International Conference on Software Engineering},
year = {2025},
}We also include the implementation of the CLI tool described in our FSE 2025 Demo Paper:
Title: A Tool for Generating Exceptional Behavior Tests With Large Language Models
Authors: Linghan Zhong, Samuel Yuan, Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, Milos Gligoric
@inproceedings{ZhongETAL25exLongTool,
author = {Zhong, Linghan and Yuan, Samuel and Zhang, Jiyang and Liu, Yu and Nie, Pengyu and Li, Junyi Jessy and Gligoric, Milos},
title = {A Tool for Generating Exceptional Behavior Tests With Large Language Models},
booktitle = {ACM International Conference on the Foundations of Software Engineering Demonstrations},
year = {2025},
}- Quick Start 🤗
- Set Up 🚀
- Experiments 👷
- Artifacts ⭐
- CLI 💻
- The exLong dataset is on Hugging Face 🤗!
from datasets import load_dataset
with_name_ds = load_dataset("EngineeringSoftware/exLong-dataset", "with-EBT-name")
no_name_ds = load_dataset("EngineeringSoftware/exLong-dataset", "no-EBT-name")- The exLong model is on Hugging Face 🤗!
pip install transformers accelerate bitsandbytes peftfrom transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
# Load the base model
base_model_name = "codellama/CodeLlama-7b-Instruct-hf"
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
# Load the LoRA configuration
peft_model_id = "EngineeringSoftware/exLong"
config = PeftConfig.from_pretrained(peft_model_id, revision="with-etest-name") # set revision to "no-etest-name" for no EBT name
# Load the LoRA model
model = PeftModel.from_pretrained(base_model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
prompt = """<s>[INST] <<SYS>>
You are a helpful programming assistant and an expert Java programmer. You are helping a user writing exceptional-behavior tests for their Java code.
<</SYS>>
Please complete an exceptional behavior test method in Java to test the method 'factorial' for the exception 'IllegalArgumentException'.
The method to be tested is defined as:
```java
public static long factorial(int n) {
if (n < 0) {
throw new IllegalArgumentException("Number must be non-negative.");
}
long result = 1;
for (int i = 1; i <= n; i++) {
result *= i;
}
return result;
}
` ` `
Please only give the new exceptional-behavior test method to complete the following test class. Do NOT use extra libraries or define new helper methods. Return **only** the code in the completion:
```java
public class FactorialTest {
}
` ` `
"""
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
# Generate code
output = model.generate(
input_ids=input_ids,
max_new_tokens=100,
temperature=0.2, # Sampling temperature (lower is more deterministic)
top_p=0.95, # Top-p (nucleus) sampling
do_sample=True # Enable sampling
)
# Decode and print the generated code
generated_code = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Code:")
print(generated_code)- Create conda environment
conda create -n exlong python=3.9
conda activate exlong
pip install -r requirements.txt- We used axolotl to fine-tune the CodeLlama model. If you want to train your own model, install the extra dependencies
# we used an older version of axolotl to train the models
git clone git@github.com:JiyangZhang/axolotl-exlong.git
cd axolotl-exlong/
conda activate exlong
pip install packaging
# set CUDA_HOME
export CUDA_HOME=/opt/apps/cuda/12.0/
pip3 install -e '.[flash-attn,deepspeed]'- Download raw dataset
mkdir -p _work/data/
mkdir -p _work/exp/
mkdir -p _work/setup/
wget -L https://utexas.box.com/shared/static/hfcp4za3j9vp8lh5u8iviadixuxu8080.gz -O raw-data.tar.gz
tar -xzf raw-data.tar.gz -C _work/data/
mv _work/data/etestgen-raw-data-12k _work/data/ne2e
wget -L https://utexas.box.com/shared/static/4m7mntp0ix18dkl1ikkspcmpuvybfs1f.gz -O ne2e-test.tar.gz
tar -xzf ne2e-test.tar.gz -C _work/data/
wget -L https://utexas.box.com/shared/static/y4e52k5x8vk8vcr59lg33gebcg2m1caw.gz -O rq2.tar.gz
tar -xzf rq2.tar.gz -C _work/data/
# netest-diversity
wget -L https://utexas.box.com/shared/static/j417e93j1rdvdqz2yobttygfhucfbkjm.gz -O netest-diversity.tar.gz
tar -xzf netest-diversity.tar.gz -C _work/data/
You should see _work/data/ne2e, _work/data/rq1-eval, _work/data/rq2 and _work/data/netest-diversity.
- Prepare dataset and put them in the
_work/setupdirectory
- exLong && exlong sample (Table IV & V)
# exlong
inv -e data.setup-model-data --setup-name conditionnestack2e-with-name-ft
inv -e data.setup-model-data --setup-name conditionnestack2e-no-name-ft
# exlong sample
inv -e data.setup-model-data --setup-name conditionnestack2e-all-with-name-ft
inv -e data.setup-model-data --setup-name conditionnestack2e-all-no-name-ftYou should see _work/setup/conditionnestack2e-with-name-ft/, _work/setup/conditionnestack2e-no-name-ft/, _work/setup/conditionnestack2e-all-with-name-ft/, _work/setup/conditionnestack2e-all-no-name-ft/ directories.
- Construct prompts for exLong developer-view
- exLong
inv -e data.process-codellama-data --setup-name conditionnestack2e-with-name-ft
inv -e data.process-codellama-data --setup-name conditionnestack2e-no-name-ft- Construct prompts for exLong machine-view
mkdir _work/setup/conditionnestack2e-all-no-name-ft/eval/ -p
cp -r _work/data/rq2/ _work/setup/conditionnestack2e-all-no-name-ft/eval/
python -m etestgen.codellama.realDataProcessor --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml process_test_dataYou will see _work/setup/conditionnestack2e-all-no-name-ft/eval/rq2/test-conditionnestack2e-all-no-name-ft.jsonl.
- Training exLong w. EBT name
Note: conditionnestack2e is the setup name for exLong
cd python/
accelerate launch -m axolotl.cli.train configs/axolotl/axolotl-conditionnestack2e-with-name-7b.yamlYou will see checkpoints in directory _work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/
- Training exLong w.o. EBT name
cd python/
accelerate launch -m axolotl.cli.train configs/axolotl/axolotl-conditionnestack2e-no-name-7b.yaml
# script to run on TACC
sbatch axolotl-lora-codellama-7b-conditionnestack2e-no-name.shYou will see checkpoints in directory _work/exp/conditionnestack2e-no-name-ft/lora-codellama-7b/
- Running inference exLong for developer-view
cd python/
# Run evaluation on the selected 434 examples in the test set
python -m etestgen.codellama.CodeLLaMA --config_file configs/codellama-7b-conditionnestack2e-with-name-ft.yaml run_gen --split real-testYou will see checkpoints, model outputs in directory _work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/real-test-set-model-outputs.jsonl
- Running inference exLong for machine-view
cd python/
python -m etestgen.codellama.CodeLLaMA --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml run_gen
# Evaluation1: all covered projects
python -m etestgen.llm.eval --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml eval_runtime_metrics
# You will see eval results in `results/model-results/conditionnestack2e-all-no-name-ft-lora-codellama-7b-eval-rq2-runtime-metrics.json`
# Evaluation2: intersection projects
python -m etestgen.llm.eval --eval_set rq2 --config_file configs/eval-codellama-7b-machine-view-conditionnestack2e-all-no-name.yaml eval_subset_llm_results --subset_id_file ../results/tool-results/intersect-ids.json
# You will see eval results in `results/model-results/conditionnestack2e-all-no-name-ft-lora-codellama-7b-eval-rq2-intersect-runtime-metrics.json`You will see model generations in directory _work/exp/conditionnestack2e-all-no-name-ft/lora-codellama-7b/rq2-model-outputs.jsonl
Given test cases generated by exLong, this step will evaluate them with metrics like BLEU, CodeBLEU, Test Coverage, etc.
-
Input/Output info
- Dataset used to evaluation is expected at
_work/{test_data} - Processed LLM prediction is expected at
_work/exp/{setup}/{model_name}/test-results - Similarity metrics result will be written to
_work/exp/{setup}/{model_name}/test-out/similarity_metrics_summary.jsonandresults/model-results/{setup}-{exp}-{eval_set}-sim-metrics.json - Runtime Metrics will be written to
results/model-results/{setup}-{exp}-{eval_set}-runtime-metrics.jsonand individual result will be at_work/exp/{setup}/{model_name}/test-results/metrics.jsonl
- Dataset used to evaluation is expected at
-
To run evaluation on similarity metrics
-
Run on an individual experiment
python -m etestgen.llm.eval --eval_set test --config_file [/path/to/config/file] eval_llm_sim
-
-
To run evaluation on runtime metrics
-
Run on an individual experiment
python -m etestgen.llm.eval --eval_set test --config_file [/path/to/config/file] eval_runtime_metrics
-
- Prepare Dataset
mkdir -p _work/setup/diversity-conditionnestack2e-sample-with-name-ft/real-eval/test/
mkdir -p _work/setup/diversity-conditionnestack2e-all-with-name-ft/real-eval/test/
cp -r _work/data/netest-diversity/* _work/setup/diversity-conditionnestack2e-sample-with-name-ft/real-eval/test/
cp -r _work/data/netest-diversity/* _work/setup/diversity-conditionnestack2e-all-with-name-ft/real-eval/test/
cd python/
python -m etestgen.codellama.DataProcessor --config_file configs/codellama-7b-diversity-conditionnestack2e-sample-with-name-ft.yaml process_real_test_data
python -m etestgen.codellama.DataProcessor --config_file configs/codellama-7b-diversity-conditionnestack2e-all-with-name-ft.yaml process_real_test_dataYou will see processed data in _work/setup/diversity-conditionnestack2e-all-with-name-ft/real-eval/test/ and _work/setup/diversity-conditionnestack2e-sample-with-name-ft/real-eval/test/.
- Running Inference
# [2nd row] use the same exLong ckpt but try prompting with the same nEBT multiple times
python -m etestgen.codellama.CodeLLaMA --config_file configs/codellama-7b-diversity-conditionnestack2e-sample-with-name-ft.yaml run_gen --split real-test --target_ckpt ../_work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/
# [3rd row] use the same exLong ckpt but try prompting with different nEBTs
python -m etestgen.codellama.CodeLLaMA --config_file configs/codellama-7b-diversity-conditionnestack2e-all-with-name-ft.yaml run_gen --split real-test --target_ckpt ../_work/exp/conditionnestack2e-with-name-ft/lora-codellama-7b/You will see model outputs in directory _work/exp/diversity-conditionnestack2e-sample-with-name-ft/lora-codellama-7b/ and _work/exp/diversity-conditionnestack2e-all-with-name-ft/lora-codellama-7b/
- Prepare Dataset
inv -e data.setup-model-data --setup-name conditionne2e-with-name-ft
inv -e data.process-codellama-data --setup-name conditionne2e-with-name-ft- Running Inference
python -m etestgen.codellama.CodeLLaMA --config_file python/configs/eval/codellama-7b-conditionne2e-with-name-ft.yaml run_gen- Prepare Dataset
inv -e data.setup-model-data --setup-name ne2e-with-name-ft
inv -e data.process-codellama-data --setup-name ne2e-with-name-ft- Running Inference
python -m etestgen.codellama.CodeLLaMA --config_file configs/eval/codellama-7b-ne2e-with-name-ft.yaml run_gen- Prepare Dataset
inv -e data.setup-model-data --setup-name mut2e-with-name-ft
inv -e data.process-codellama-data --setup-name mut2e-with-name-ft- Running Inference
python -m etestgen.codellama.CodeLLaMA --config_file configs/eval/codellama-7b-mut2e-with-name-ft.yaml run_gen- exLong-with-name (7B and 13B): exLong models in Table IV, Table VI and Table VIII.
- exLong-no-name (7B): exLong models in Table V.
- exLong-with-name w.o. stack trace (7B): exLong no stack trace model in Table VI.
- exLong-with-name w.o. stack trace & guard expr (7B): exLong no stack trace & no guard expr model in Table VI.
- exLong-with-name w.o. stack trace & guard expr & EBT (7B): exLong no stack trace & no guard expr & no EBT model in Table VI.
- exLong-with-name w.o. stack trace & guard expr & EBT (13B): exLong 13B no stack trace & no guard expr & no EBT model in Table VIII.
- repos.tar.gz: The repository list from which we collected the dataset.
- raw-data.tar.gz: The raw collected data from the open-source repositories.
etestgen-raw-data-12k/ - ne2e-test.tar.gz: The collected dataset for eval in developer-view.
rq1-eval/ - machine-view.tar.gz: The collected dataset for eval in machine-view.
rq2/ - netest-diversity.tar.gz: The collected dataset we use to study how the different nEBTs affect model's performance (Table VII).
netest-diversity/ - processed dataset: The processed dataset (prompts) to train the exLong models.
In User View, Exlong generates an EBT for a user-specified target throw statement. Use the following command:
python -m etestgen.cli user_view [OPTIONS]--repo_path: Local path or remote link to the git repository--mut_file_path: Path to the file containing the MUT--mut_line: Line number of the beginning of the MUT's definition--throw_file_path: Path to the file containing the target throw statement--throw_line: Line number of the target throw statement--test_context_path: Path to the test file
--sha: Commit SHA (default: latest commit on the main branch)--test_name: Name of the test method to be generated (default: none)--quant: Whether to use quantized LLM (default: true)--pick_best: Whether to sample multiple candidate EBTs and select the best test based on runtime evaluation (default: false)--output_file: Output file path for the generated EBT, if not given the EBTs are added to the test object intest_context_path.--regenerate_data: Whether to run collect "stack trace", "guard condition", and etc. (default: true)
python -m etestgen.cli user_view \
--repo_path=./Wisp \
--mut_file_path=Scheduler.java \
--mut_line=180 \
--quant=true \
--throw_file_path=Scheduler.java \
--throw_line=340 \
--test_context_path=SchedulerTest.java \
--sha="ce1d9f3cb1944115ad98b4428ea24b24ab3faf56" \
--test_name=testSchedulerError \
--pick_best=True \
--output_file=./ExlongTest.javaThis command will generate an exception-based test for the throw statement at line 340 in Scheduler.java, targeting the method that begins at line 180, and output the result to ExlongTest.java.
The Machine View generates EBTs for the entire codebase to cover all throw statements automatically. Use the following command:
python -m etestgen.cli machine_view [OPTIONS]--repo_pathor--repo_link: Local path or remote link to the git repository--test_context_path: Path to the test file
--sha: Commit SHA (default: latest commit on the main branch)--pick_best: Whether to sample multiple candidate EBTs for each throw statement and select the best test based on runtime evaluation (default: false)--quant: Whether to use quantized LLM (default: true)--timeout: Time budget for the tool to finish processing in seconds (default: infinity)--output_file: Output file path for the generated EBTs, if not given the EBTs are added to the test object intest_context_path.--regenerate_data: Whether to run collect "stack trace", "guard condition", and etc (default: true)
python -m etestgen.cli machine_view \
--repo_link="https://github.com/Coreoz/Wisp.git" \
--sha="ce1d9f3cb1944115ad98b4428ea24b24ab3faf56" \
--timeout=1000This command will analyze the entire Wisp repository at the specified commit and generate EBTs for all throw statements within the given time budget of 1000 seconds.