[EMNLP2024] LongRAG: A Dual-perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering
LongRAG is a general, dual-perspective, and robust LLM-based RAG system paradigm for LCQA to en-hance RAG’s understanding of complex long-context knowledge (i.e., global information and factual details)
Install the requirements with pip: pip install -r requirements.txt
. We recommend using FlashAttention 2 for optimization and saving GPU memory. The relevant dependencies can be installed according to the code base of FlashAttention.
Our raw training data comes from HotpotQA,2WikiMultihopQA,MuSiQue and Qasper. The evaluation data and the corresponding retrieval corpus raw data are sourced from LongBench.
We have standardized the data format for the aforementioned datasets. You can download our standardized raw datasets by running the following command:
bash download/raw_data.sh
The data will be downloaded in the data/
.
Build the LRGinstruction dataset for SFT:
cd src
python gen_instruction.py --per_task_num 200 --min_res_tokens 20 --long_ratio 0.2
Save the processed data in data/train/processed
.
Build an index for retrieval and save the mapping relationship between chunks and the original text:
cd src
python gen_index.py --dataset hotpotqa --chunk_size 200 --min_sentence 2 --overlap 2
Save the processed data in data/corpus/processed
.
First, you need to download LLaMA-Factory to our project. Then put our constructed instruction data into LLaMA-Factory/data
and add the following entry to dataset_info.json
:
"LRGinstruction": {
"file_name": "LRGinstruction.json",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output"
}
}
Then run the following script to start fine-tuning:
cd scripts
bash sft.sh $model_name_or_path $template $cutoff_len
model_name_or_path
should correspond to the template, and cutoff_len
is the truncation length.
Here are some example scripts for performing inference and evaluation on HotpotQA. To get started, first navigate to the src
directory.
We provide examples of inference using the ChatGLM3-6B-32k model.
LongRAG-ChatGLM3-6B-32k (without SFT):
CUDA_VISIBLE_DEVICES=0 python main.py --dataset hotpotqa --model chatGLM3-6b-32k --rb --rl --ext --fil --ext_fil
LongRAG-ChatGLM3-6B-32k (with SFT):
CUDA_VISIBLE_DEVICES=0 python main.py --dataset hotpotqa --model LongRAG-chatglm3-32k --rb --rl --ext --fil --ext_fil
Using only the Extractor, with the generator using GPT-3.5-turbo and the Extractor using LongRAG-chatglm3-32k:
CUDA_VISIBLE_DEVICES=0,1 python main.py --dataset hotpotqa --model gpt-3.5-turbo --lrag_model LongRAG-chatglm3-32k --ext
Using only the Filter, with the generator using GPT-3.5-turbo and the Filter using LongRAG-chatglm3-32k:
CUDA_VISIBLE_DEVICES=0,1 python main.py --dataset hotpotqa --model gpt-3.5-turbo --lrag_model LongRAG-chatglm3-32k --fil
Using both Extractor & Filter, with the generator using GPT-3.5-turbo and both the Extractor & Filter using LongRAG-chatglm3-32k:
CUDA_VISIBLE_DEVICES=0,1 python main.py --dataset hotpotqa --model gpt-3.5-turbo --lrag_model LongRAG-chatglm3-32k --ext_fil
Note: The parameters --rb
, --rl
, --ext
, --fil
, and --ext_fil
represent running RAG-Base, RAG-Long, Extractor, Filter, and Extractor & Filter, respectively. These parameters can be combined arbitrarily.
Evaluation results will be saved in the log
directory.
Below are partial experimental results, showcasing the F1 scores on three multi-hop datasets from LongBench, using the LongRAG paradigm.
Note: Following the LongBench settings, for text that exceeds the model's processing length, we truncate it from the middle of the text and retain the beginning and end information.
HotpotQA | 2WikiMultiHopQA | MusiQue | Avg | |
---|---|---|---|---|
LongRAG-Qwen-1.5-7B-32k w/ SFT | 52.91 | 46.65 | 31.85 | 43.80 |
LongRAG-Llama3-8B-8k w/ SFT | 52.39 | 49.67 | 31.70 | 44.59 |
LongRAG-Vicuna-v1.5-7B-16k w/ SFT | 55.55 | 50.13 | 28.29 | 44.66 |
LongRAG-ChatGLM3-6B-32k w/ SFT | 55.93 | 54.85 | 33.00 | 47.93 |
LongRAG-GPT-3.5-Turbo w/o SFT | 56.17 | 51.37 | 32.83 | 46.79 |
LongRAG-GPT-3.5-Turbo-16k w/o SFT | 59.11 | 51.25 | 30.37 | 46.91 |
LongRAG-GLM-4 w/o SFT | 62.11 | 57.16 | 38.40 | 52.56 |
If you find our work useful, please consider citing LongRAG:
@article{zhao2024longrag,
title={LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering},
author={Qingfei Zhao and Ruobing Wang and Yukuo Cen and Daren Zha and Shicheng Tan and Yuxiao Dong and Jie Tang},
journal={arXiv preprint arXiv:2410.18050},
year={2024}
}