Evaluating LLM performance and sensitivity when there is a "task-switch"
This is the codebase of the paper: "LLM Task Interference: An Initial Study on the Impact of Task-Switch in Conversational History" Authors: Akash Gupta, Ivaxi Sheth, Vyas Raina, Mark Gales, Mario Fritz
Motivation
Typically, when an LLM responds to a user prompt, the model conditions itself based on the prior conversation history to provide some basic short-term memory. Generally, this sensitivity to the history is efficacious, but can be counteractive when there is a "task-switch". In this repo, we evaluate the performance of models when switching tasks.
Figure 1: An illustrative example of task-switch.
Top box: The chat history is based on sentiment prediction. Algebra word problem introduces a task-switch, which results in an incorrect prediction.
Bottom box: The LLM is well behaved when there is no conversation history.
We use the example above to define some terms used throughout the repository:
- A turn consists of a user prompt
$u$ , and a system response$r$ .- In the example above, the Top box shows 3 turn, while the Bottom box has 1 turn.
- A conversation history (CH) consists of multiple turns:
$\boldsymbol{h} = {(u_k, r_k)}_{k = 1}^{L}$ , where$L$ is the length of the conversation history.- In the example above, the Top box has a conversation history length
$L = 2$ , where as the bottom box has no conversation history$L = 0$ . This is equivalent to a "zero-shot" setting.
- In the example above, the Top box has a conversation history length
-
incontext_data: This is the dataset used to provide teacher-forced examples to form a conversation history.- In the example above, this dataset is the one for "Sentiment Prediction" (e.g. rotten tomatoes)
- A target task is the task performed upon a task-switch. The dataset used for this task is the
eval_data.- In the example above, the target task dataset is Algebra (e.g. from MMLU High School Math).
After running experiments (or using our results), you can reproduce the plots shown in this markdown file (or the paper) using the notebook provided in results/plot_metrics.ipynb. For multiple seeds, you may use the notebook: results/plot_seeds.ipynb.
We calculate the percentage change in peformance relative to zero-shot using the function df_metric_pct_change():
Figure 2: Target Task: MMLU Abstract Algebra (multiple choice questions). Percentage change in accuracy relative to zero-shot (higher means better performance).
As expected, when the conversation history task is MMLU Abstract Algebra, most models perform well.
However, when the conversation history task is different, some models perform worse than zero-shot, suggesting that the model is sensitive to that task-switch.
We calculate the sensitivity in peformance relative to zero-shot using the function expectation_metrics() (in results/plot_metrics.ipynb):
Figure 11a: Sensitivity of models for the Target Task: Rotten Tomatoes (sentiment classification)
As expected, when the conversation history task remains as rotten tomatoes, all models perform well.
However, when the conversation history task is different, some models perform worse than zero-shot, suggesting that the model is sensitive to this task-switch.
Install the relevant conda dependencies from environment.yaml and python packages using pyproject.toml.
conda env create -f environment.yamlpip install .
Note: you may need to login to huggingface to use the models. Use huggingface-cli login to login.
There are a few entry points depending on the type of experiment you would like to run.
To measure the model performance on task-switches, use the following scripts:
main.py: conversations with teacher-forced responses for the history task - this represents the outcome of a theoretically "perfect" modelconversational.py: conversations where the model generates its own answers to the history task (i.e without teacher-forcing)random_conversation.py: conversations where the history task is randomly generated by the model itself
To measure the model sensitivity to task-switches, use likelihoods.py.
- All args that can be specified can be found in src/tools/args.py
- Args are also documented below and in the entry point files
- Models can be found in src/inference/models.py (See models section for more details)
- Datasets can be found in src/data/dataloader.py (See datasets section for more details)
- Results are plot using the notebooks in
results/such asresults/plot_metrics.ipynb
For ease of reproducability, we provide our results in experiments/. These are tracked using git-lfs.
It would be beneficial to move this to a separate folder before running your own experiments.
When running your experiments, results are saved in ./experiments/<model>/eval_data_<dataset>/incontext_data_<dataset>/num_examples_<int>/iterative/. See src/tools/saving.py for further details.
(For likelihoods.py and random_conversation.py, we don't split the results by num_examples_<int> - you may request an issue if you would like clarity.)
To evaluate the performance of a model on task-switch, use the command:
python main.py \
--num_examples <int> \
--model_name <model>
--eval_data_name <dataset> \
--incontext_data_name <dataset> \
--iterative- For zero-shot performance, use
--num_examples 0 - To specify a dataset, see the datasets section
- To specify a model, see the models section
Optionally, the following args can be specified
--eval_size <int|blank>set the number of examples to use in the test set for evaluating performance (typically 1000)--force_rerunforces re-running the experiment, otherwise the results will be loaded from the./experimentsfolder--no_predictthis will skip loading / running the model for evaluating performance. This is useful to debug prompts
To evaluate the sensitivity of a model on task-switch, use the command:
python likelihoods.py \
--num_examples <int> \
--model_name <model>
--eval_data_name <dataset> \
--incontext_data_name <dataset> \
--iterative
--likelihoodsNOTE: this will only work for models for which we have access to their logits (i.e. llama-7b and mistral-7b).
We recommend running with --num_examples 0 for the zero-shot likelihoods, and then running it for more examples (e.g. 3, 6.) This is because the sensitivity metrics are calculate relative to a baseline of --num_examples 0.
Optionally, the following args can be specified:
--eval_size <int|blank>set the number of examples to use when calculating the likelihoods (typically 10-100). NOTE: when running experiments for a specific combination ofmodel-eval_data-incontext_data, the--eval_sizemust be kept the same, otherwise you will not be able to compare results between a different number ofnum_examples.
We support instruction tuned models from Hugging Face and Open AI. The <model> and their details are shown in the table below:
<model> |
Details | Type |
|---|---|---|
"mistral-7b" |
"mistralai/Mistral-7B-Instruct-v0.1" |
Hugging Face |
"llama-7b" |
"meta-llama/Llama-2-7b-chat-hf" |
Hugging Face |
"gpt3.5" |
"gpt-3.5-turbo" |
Open AI |
"gpt4" |
"gpt-4" |
Open AI |
To run GPT3.5 / GPT4, an openAI API key is required. Specify this in a .env file such as:
# .env
OPENAI_API_KEY="mykey"When running the scripts, datasets can be specified in the args:
--incontext_data_name <dataset>: the dataset used for teacher-forced examples. In task-switching, this is the "conversation history".--eval_data_name <dataset>: the dataset to evaluate on. In task-switching, this is the "target task".
The datasets that can be used are shown in the table below, alongside their source:
<dataset> |
Source url |
|---|---|
"gigaword" |
Hugging Face |
"rotten_tomatoes" |
Hugging Face |
"tweetqa" |
Hugging Face |
"mmluaa" |
Hugging Face |
"mmlu-age" |
Hugging Face |
"dailymail" |
Hugging Face |
"gsm8k" |
Hugging Face |
"mmlu-moral" |
Hugging Face |
"mmlu-math" |
Hugging Face |
"mmlu-law" |
Hugging Face |
Dailymail examples have a large number of tokens, which is a problem when evaluating with llama-7b as it has a max token length of 4096. In ./src/data/dataloader.py::DataLoader::remove_large_dataset_examples(), we limit the size of each user-system conversation in the conversation history to be less than 1792, allowing for a conversation history length
We noticed that our results were not consistent with time, because the models were updating fairly frequently. Be warned that our results may not match up with the current state of the models.
If you use Task-Switch, or scripts provided in this repository (eg., evaluation scripts) in your work, please cite the following paper:
@article{taskswitch2024,
title={LLM Task Interference: An Initial Study on the Impact of Task-Switch in Conversational History},
author={Gupta, Akash and Sheth, Ivaxi and Raina, Vyas and Gales, Mark and Fritz, Mario},
journal={arXiv preprint arXiv:2402.18216},
year={2024}
}

