Verbosity Compensation

This repository maintains the code and data for the paper "Verbosity ≠ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models"

Overview

In this paper, we discover an understudied type of undesirable behavior of LLMs, which we term Verbosity Compensation (VC) — similar to the hesitation behavior of humans under uncertainty — where they respond with excessive words such as repeating questions, introducing ambiguity, or providing excessive enumeration.

In this figure, we ask the model to generate as concisely as possible. In the first response, LLM generates a concise answer that is correct with low uncertainty. In the second and third responses, instead of generating an answer concisely, such as “16.5”, LLM repeats the question, and produces ambiguity, leading to a VC response with low performance and high uncertainty.

Model and Experiments

Setting Environment

Install the following packages:

pip install nltk, retrying, fuzzywuzzy, difflib, rouge_score
pip install openai, anthropic, google-generativeai
pip install torch, transformers

Preparing Dataset

First, put the raw datasets in the folder dataset. The raw datasets to create our 5 datasets are:

Qasper: SCROLLS
LongBench: LongBench
NarrativeQA: SCROLLS
NQ30: Lost-in-the-middle
MMLU: MMLU

Then, run the code preprocessed/{dataset_name}.py to preprocess the code, the preprocessed dataset will be in prerpocessed/dataset folder.

Running LLM

The code for running LLMs is in the root folder, named {model.py}. Let's take gpt.py an example.

First, change the key in gpt.py and set up the commands in scripts/gpt3.5.sh.
Then, bash scripts gpt3.5.sh.
Results will be in result/{dataset_name}/{model_name} folder, after running is done.

We also upload the predictions of 5 datasets used our paper to Google Drive.

Evaluation

To evaluate the results, find the result in result/scrolls_qasper/gpt-3.5-turbo-0125/12000.json, and go to analysis/calculate_VC.py, replace the directory with this file path to get detailed evaluation results.

Folder Structure

.
├── analysis                               # VC analysis
│   └── calculate_VC.py                    # The file to compute VC statistics
│
├── dataset                                # the folder to store raw datasets
|
├── metrics/metric_lib                     # Metrics for evaluation
│   ├── f1.py                              # F1 and recall computation for QA tasks
│   └── longbench.py                       # F1 from LongBench repo
│
├── preprocessed                                
│   ├── dataset                            # Preprocessed dataset
│   └── {dataset_name}.py                  # Code for preprocess raw datasets
│
├── results                                # Folder containing the running results
│
├── scripts                                # Shell files for running LLMs
│    └── {model_name}.sh                   # Shell for a model
│
├── {model_name}.py                        # Code for running one model
│    └──  ...                              # Please check the README in this directory.
|
└── README.md                              # Where you are reading now ^_^

Experiment Results

Some typical Verbosity Compensation examples

Frequency of Verbosity Compensation in different models

Verbosity Compensation and Performance

Overall recall comparison between verbose and concise responses. Bold/Underline indicates the largest positive/negative performance gap between verbose and concise responses. The verbose responses obtain a significantly different performance than the concise ones, demonstrating the strong relationship between verbosity and performance.

Overall recall comparison between verbose and concise responses. Bold/Underline indicates the largest positive/negative performance gap between verbose and concise responses. Similar to the previous table, the verbose responses obtain a significantly different performance than the concise ones.

Add a New Task

If you would like to add your datasets to the evaluation. You can do the following steps:

Put the dataset in dataset folder
Write the code for preprocessing the raw dataset and put it in preprocessed folder
Run the models in the root folder
Analyze the results using calculate_VC.py and observe Verbosity Compensation

Dataset License

Our dataset, predictions, and code are provided under the CC BY-SA 4.0.

Please also take a look at the license information of the datasets we use to construct ours.

SCROLLS MIT license
- Qasper CC BY 4.0
- NarrativeQA Apache-2.0 license
LongBench MIT license
- 2wikimqa Apache-2.0 license
- hotpotqa CC BY-SA 4.0 license
- multifieldqa_en MIT license
- musique CC-BY-4.0 license
Lost-in-the-middle MIT license
- Natural Questions (NQ) Apache-2.0 license
MMLU MIT license

Citation

@article{zhang2024verbosity,
  title={Verbosity $$\backslash$neq $ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models},
  author={Zhang, Yusen and Das, Sarkar Snigdha Sarathi and Zhang, Rui},
  journal={arXiv preprint arXiv:2411.07858},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analysis		analysis
dataset		dataset
metrics/metric_lib		metrics/metric_lib
preprocessed		preprocessed
result/scrolls_qasper		result/scrolls_qasper
scripts		scripts
src		src
LICENSE.txt		LICENSE.txt
README.md		README.md
claude.py		claude.py
gemini.py		gemini.py
gemma.py		gemma.py
gpt.py		gpt.py
llama3.py		llama3.py
mistral.py		mistral.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Verbosity Compensation

Overview

Model and Experiments

Setting Environment

Preparing Dataset

Running LLM

Evaluation

Folder Structure

Experiment Results

Some typical Verbosity Compensation examples

Frequency of Verbosity Compensation in different models

Verbosity Compensation and Performance

Add a New Task

Dataset License

Citation

About

Releases

Packages

Languages

License

psunlpgroup/VerbosityLLM

Folders and files

Latest commit

History

Repository files navigation

Verbosity Compensation

Overview

Model and Experiments

Setting Environment

Preparing Dataset

Running LLM

Evaluation

Folder Structure

Experiment Results

Some typical Verbosity Compensation examples

Frequency of Verbosity Compensation in different models

Verbosity Compensation and Performance

Add a New Task

Dataset License

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages