This repository maintains the code and data for the paper "Verbosity ≠ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models"
In this paper, we discover an understudied type of undesirable behavior of LLMs, which we term Verbosity Compensation (VC) — similar to the hesitation behavior of humans under uncertainty — where they respond with excessive words such as repeating questions, introducing ambiguity, or providing excessive enumeration.
In this figure, we ask the model to generate as concisely as possible. In the first response, LLM generates a concise answer that is correct with low uncertainty. In the second and third responses, instead of generating an answer concisely, such as “16.5”, LLM repeats the question, and produces ambiguity, leading to a VC response with low performance and high uncertainty.
Install the following packages:
pip install nltk, retrying, fuzzywuzzy, difflib, rouge_score
pip install openai, anthropic, google-generativeai
pip install torch, transformers
First, put the raw datasets in the folder dataset
. The raw datasets to create our 5 datasets are:
- Qasper: SCROLLS
- LongBench: LongBench
- NarrativeQA: SCROLLS
- NQ30: Lost-in-the-middle
- MMLU: MMLU
Then, run the code preprocessed/{dataset_name}.py
to preprocess the code, the preprocessed dataset will be in prerpocessed/dataset
folder.
The code for running LLMs is in the root folder, named {model.py}
. Let's take gpt.py
an example.
-
First, change the key in
gpt.py
and set up the commands inscripts/gpt3.5.sh
. -
Then,
bash scripts gpt3.5.sh
. -
Results will be in
result/{dataset_name}/{model_name}
folder, after running is done.
We also upload the predictions of 5 datasets used our paper to Google Drive.
To evaluate the results, find the result in result/scrolls_qasper/gpt-3.5-turbo-0125/12000.json
, and go to analysis/calculate_VC.py
, replace the directory with this file path to get detailed evaluation results.
.
├── analysis # VC analysis
│ └── calculate_VC.py # The file to compute VC statistics
│
├── dataset # the folder to store raw datasets
|
├── metrics/metric_lib # Metrics for evaluation
│ ├── f1.py # F1 and recall computation for QA tasks
│ └── longbench.py # F1 from LongBench repo
│
├── preprocessed
│ ├── dataset # Preprocessed dataset
│ └── {dataset_name}.py # Code for preprocess raw datasets
│
├── results # Folder containing the running results
│
├── scripts # Shell files for running LLMs
│ └── {model_name}.sh # Shell for a model
│
├── {model_name}.py # Code for running one model
│ └── ... # Please check the README in this directory.
|
└── README.md # Where you are reading now ^_^
Overall recall comparison between verbose and concise responses. Bold/Underline indicates the largest positive/negative performance gap between verbose and concise responses. The verbose responses obtain a significantly different performance than the concise ones, demonstrating the strong relationship between verbosity and performance.
Overall recall comparison between verbose and concise responses. Bold/Underline indicates the largest positive/negative performance gap between verbose and concise responses. Similar to the previous table, the verbose responses obtain a significantly different performance than the concise ones.
If you would like to add your datasets to the evaluation. You can do the following steps:
- Put the dataset in
dataset
folder - Write the code for preprocessing the raw dataset and put it in
preprocessed
folder - Run the models in the root folder
- Analyze the results using
calculate_VC.py
and observe Verbosity Compensation
Our dataset, predictions, and code are provided under the CC BY-SA 4.0.
Please also take a look at the license information of the datasets we use to construct ours.
- SCROLLS MIT license
- Qasper CC BY 4.0
- NarrativeQA Apache-2.0 license
- LongBench MIT license
- 2wikimqa Apache-2.0 license
- hotpotqa CC BY-SA 4.0 license
- multifieldqa_en MIT license
- musique CC-BY-4.0 license
- Lost-in-the-middle MIT license
- Natural Questions (NQ) Apache-2.0 license
- MMLU MIT license
@article{zhang2024verbosity,
title={Verbosity $$\backslash$neq $ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models},
author={Zhang, Yusen and Das, Sarkar Snigdha Sarathi and Zhang, Rui},
journal={arXiv preprint arXiv:2411.07858},
year={2024}
}