This repo is for benchmarking LLM's ability to extract small bits of information from long context. We adapted the benchmark from Greg Kamradt's original Needle in a Haystack Benchmark to our preferences.
Original benchmark Original tweet
Aside from the benchmarking code, we also create a dataset to fine-tune for the task at hand.
In the original "Needle in a Haystack" benchmark, we extact a small bit of information, called "the needle", from a large context. The large context, called "the haystack", are concatenated esseys by Paul Graham. The following text ("needle") is inserted at varying positions into these esseys varying positions: "The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.". Note that the information from the esseys and the needle are not related much. Therefore, it might be easier for a model to single out information about the needle in the haystack. The model is then given the haystack with the needle and asked "What is the best thing to do in San Francisco?" and not to "give information outside the document or repeat your findings". Note that eating a sandwich and sitting in Dolores Park on a sunny day is understood to be a good answer to the posed question based on general knowledge outside the context. We therefore expect models trained on large chunks of publicly available data to be preconditioned to output such information.
To facilitate better understanding of our code, we link important lines here. This benchmark makes the following changes:
- Based on the wiki_bio dataset.
- We randomize all names so that model's can not rely on previously learnt information. (link)
- While the needle is a random biography, the haystack is a concatenation of equally random biographies. (link)
- The intuition behind the design choice is that information in the haystack should be similar to the needle so that the benchmark gives us a better understanding of how well-suited the model is to index large chunks of similar data.
- Model must extract multiple small bits of information: (link)
- Date of birth
- Date of death
- Nationality
- Whether or not the person in question is/was a sportsperson
- Whether or not the person in question is/was a politician
- Structure model's outputs as json or dictionaries (link)
- This breaks down the complexity of evaluation and makes it more reliable.
- It also reduces the cost of the benchmark.
Before running anything, note that the provided code is neither production grade, nor a general tool. You will need to understand and modify it if you want to do anything but reproducing results.
We use the following workflow:
- Create synthetic datasets with the tools under
dataset/
- Run
dataset/clean_biographies_ds.py
to download the biographies dataset and clean it. - Run
dataset/create_fine_tuning_ds.py
to create fine-tuning datasets. Take a good look at the script before and fit it to your system if needed. (This can query Anyscale Endpoints to create labels for the dataset)
- Run
- Maybe fine-tune with your tool of choice. The dataset are in an OpenAI/Anyscale compatible format.
- Fit
plot_aggregated.py
andplot_haystack.py
to whatever models you are benchmarking. - Benchmark and plot with
bio_haystack_benchmark.py
- This requires you to set set your
AE_API_KEY
andOPENAI_API_KEY
as environment variables. Comment out relevant lines if needed.
- This requires you to set set your
- After benchmarking some models, use
plot_aggregated.py
to plot an overview.