KG-Infused RAG

This repository is the official implementation of KG-Infused RAG: Augmenting Corpus-Based RAG with External Knowledge Graphs.

🧭 Overview

We propose KG-Infused RAG, a framework that integrates pre-existing, large-scale KGs to enhance retrieval and generation. At its core, KG-Infused RAG employs 🧠spreading activation, a concept from cognitive psychology in which activation propagates from a central concept to related ones in a semantic network.

The framework consists of three modules that together enable interpretable, fact-grounded multi-source retrieval: 1) KG-Guided Spreading Activation, 2) KG-Based Query Expansion, and 3) KG-Augmented Answer Generation.

🛠️ Setup

1. Environment

Step 1: clone this repo

git clone git@github.com:thunlp/KG-Infused-RAG.git
cd KG-Infused-RAG

Step 2: Create environment and install dependencies

conda create -n kg-infused-rag python=3.10
conda activate kg-infused-rag
pip install -r requirements.txt
pip install -e .

2. Data Preparation

Datasets

All evaluation and training data can be downloaded here. Place the data under the ./data/datasets directory. The data is organized into two parts:

Evaluation Data: Includes the original test set and the corresponding initial retrieval results from both the corpus and the knowledge graph.
Training Data: Contains the original sampling outputs on the training set, as well as the constructed DPO training data derived from them.

Corpus and Knowledge Graph

Download the corpus, unzip the file and place the extracted data under the ./data/corpus directory:

mkdir -p data/corpus
cd data/corpus
wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
gunzip psgs_w100.tsv.gz

The knowledge graph used in our experiments are available via both 🤗 Hugging Face and ModelScope. Download the KG (Wikidata5M-KG), unzip the file and place the extracted data under the ./data/KG directory:

mkdir -p data/KG
tar -xzvf wikidata5m_kg.tar.gz -C data/KG

Generate the embedding of the passages from corpus and entity descriptions from Wikidata5M-KG:

bash ./scripts/generate_embeddings_corpus.sh
bash ./scripts/generate_embeddings_kg.sh

3. Model

Retriever: Contriever-MS MARCO
Generator: Models from the Qwen2.5 series, LLaMA3.1 series, and GPT family, including GPT-4o-mini and GPT-4o.

🚀 Implementation and Experiments

(Optional) Initial Retrieval

You can pre-retrieve top-k passages from the corpus and relevant entities from the KG for each input question to save time during the main pipeline:

bash ./scripts/retrieval.sh

💡 Precomputed retrieval results are available here (see Evaluation Data in Datasets).

1. Main Pipeline

bash ./scripts/kg_infused_rag.sh

2. Training (DPO)

2.1 Build DPO data

a) Collect sampling data: Run the main pipeline multiple times while varying generation parameters (e.g., temperature, top_p) for the target generation stage to obtain diverse responses. Save the raw outputs under a folder such as ./data/sampling_results_for_dpo_data_construction/{stage}/.

b)Construct DPO pairs: Use the script below to turn the sampling data into DPO-formatted training examples:

cd train
python generate_dpo_data.py

We provide example raw sampling data and the constructed DPO pairs here.

2.2 Run DPO training

bash train.sh

📄 Cite

If you find our code, data, models, or the paper useful, please cite the paper:

@article{wu2025kg,
  title={KG-Infused RAG: Augmenting Corpus-Based RAG with External Knowledge Graphs},
  author={Wu, Dingjun and Yan, Yukun and Liu, Zhenghao and Liu, Zhiyuan and Sun, Maosong},
  journal={arXiv preprint arXiv:2506.09542},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
contriever		contriever
kg_infused_rag		kg_infused_rag
scripts		scripts
train		train
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KG-Infused RAG

🧭 Overview

🛠️ Setup

1. Environment

2. Data Preparation

Datasets

Corpus and Knowledge Graph

3. Model

🚀 Implementation and Experiments

(Optional) Initial Retrieval

1. Main Pipeline

2. Training (DPO)

2.1 Build DPO data

2.2 Run DPO training

📄 Cite

About

Uh oh!

Releases

Packages

Uh oh!

Languages

thunlp/KG-Infused-RAG

Folders and files

Latest commit

History

Repository files navigation

KG-Infused RAG

🧭 Overview

🛠️ Setup

1. Environment

2. Data Preparation

Datasets

Corpus and Knowledge Graph

3. Model

🚀 Implementation and Experiments

(Optional) Initial Retrieval

1. Main Pipeline

2. Training (DPO)

2.1 Build DPO data

2.2 Run DPO training

📄 Cite

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages