This repo includes the implementation of our paper RAG-LER: Ranking Adapted Generation with Language-Model Enabled Regulation.
We introduce RAG-LER, a novel framework that enhances an LM’s context understanding and improves the quality and accuracy of provided passages through an LM-supervised re-ranker. RAG-LER fine-tunes a pre-trained LM to follow instructions and discriminately use provided information. It then leverages this fine-tuned LM to generate ranking scores, which serve as supervised labels for training the re-ranker.
- We open-weighted our trained Mistral-7B model on HuggingFace to encourage and advance further research.
- We have updated our re-ranker training method which incorporates a reference model during training.
- We immigrated our experiment recording from mlflow to wandb. Both are wonderful tools for experimental tracking.
- We addressed some dependency issues mainly on retrieval.
Installation can be done by running the command:
# Clone the repo
git clone https://github.com/notoookay/rag-ler.git
cd rag-ler
source setup.shWe use wandb for our experiment recording.
Our training data can be downloaded at HuggingFace. The training datasets can be found in the data directory.
Our training data includes datasets for training LLM and re-ranker. See our paper for details of data processing.
The LM training data can be found in llm_train.jsonl. We include a set of instruction-tuning datasets and open-domain QA datasets for improving the capability of Instruction-following and Reading Comprehension.
To fine-tune an LLM under same configuration described in paper, you can directly run the fine-tuning scripts.
bash ./scripts/finetune_llm.shNote: please check and modify the data path in the script (same for the scripts below).
Feel free to modify the settings in this script for custom test.
Our 7B and 13B models are available on HuggingFace in case you want to use directly.
For quick start, you can use trained LM:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("notoookay/ragler-llama2-7b")
model = AutoModelForCausalLM.from_pretrained("notoookay/ragler-llama2-7b", torch_dtype=torch.bfloat16, device_map="auto")
# Example usage
input_text = "### Instruction:\nAnswer the following question.\n\n### Input:\nQuestion:\nWhat is the capital of France?\n\n### Response:\n"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))The re-ranker training data can be found in reranker_train.jsonl. It includes Natural Questions and HotpotQA for single-hop and multi-hop Question Answering. It already includes the retreived passages from Dec 2018 wikidump using Contriever MS-MARCO. In case you need to retrieve by yourself, you can download the corpus from Contriever by running:
# Download Dec 2018 corpus
wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
# Download corresponding embeddings
wget https://dl.fbaipublicfiles.com/contriever/embeddings/contriever-msmarco/wikipedia_embeddings.tarAs re-ranker training is supervised by LM, we need to get training label (probabilities) from LM, you can get labels from LM by running:
bash ./scripts/prepare_reranker_train_data.shIn addition to Llama2 family models, we also trained Mistral-7b model, which uses less memory (about 40GB), so you may reduce inference cost when testing on Mistral model.
We inference our fine-tuned LLMs using the same configuration in the script above. After you get the labels, the re-ranker can be trained by running:
bash ./scripts/finetune_reranker.shIn default, we use Contriever-MS MARCO as our retriever. We use Dec 2018 wiki dump mentioned above for our evaluation. We use Dec 2020 for PopQA. For corpus downloading, please refer to Atlas corpus download guide.
You can retrieve with sparse retriever (e.g. BM25). We use pyserini for our BM25 retrieval.
Before retrieval, you need to build BM25 index for the corpus. Please check this for building BM25 index of the corpus. After building the index, you can retrieve by running:
bash ./scripts/passage_retrieval_bm25.shWe split the sparse and dense retrieval for clarity.
For dense retrieval (e.g. Contriever), you need to generate the embeddings of both your input data and corpus. To build dense embedding, you can run:
python retrieval/generate_passage_embeddings.py \
--model_name_or_path facebook/contriever-msmarco \
--output_dir embeddings/enwiki-dec2021 \
--passages corpora/wiki/enwiki-dec2021/text-list-100-sec.jsonl \
--shard_id 0 --num_shards 1
# Or using script directly
bash ./scripts/generate_passage_embeddings.shAfter generating the embeddings, you can retrieve by running:
bash ./scripts/passage_retrieval.shWe use FAISS for our similarity search of dense vectors. We recommend using faiss-gpu for fast search, which costs about 110GB GPU memory for Dec 2020 wikidump.
We evaluate on a set of knowledge-intensive tasks including open-doamin QA and fact checking.
In addition to knowledge-intensive tasks, we also include several commonsense-reasoning evaluations, which typically do not need retrieval.
You can evaluate by running:
bash ./scripts/run_llm.shNote: Remember to modify the arguments you use, for more specific details of arguments, please refer to our paper.
If you find our work helpful, please consider citing our paper:
@article{ZHAI2025131514,
title = {RAG-LER: Ranking adapted generation with language-model enabled regulation},
journal = {Neurocomputing},
volume = {656},
pages = {131514},
year = {2025},
issn = {0925-2312},
doi = {https://doi.org/10.1016/j.neucom.2025.131514},
url = {https://www.sciencedirect.com/science/article/pii/S0925231225021861},
author = {Fengwen Zhai and Wenyang Tang and Jing Jin},
keywords = {Language modeling, Retrieval augmented generation, Information retrieval, Re-ranking}
}We welcome contributions from the community! Whether it's fixing bugs, adding new features, improving documentation, or providing feedback, your help is invaluable.
