Official Implementation of "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs".
Has been accepted at ACL2024 Findings.
Aiming evaluate not only final answers but also intermediate steps in the CoT reasoning capabilities of LLMs in multi-hop question answering, the paper proposed 2 evaluation modules:
- Discriminative: assess LLMs' knowledge of reasoning
- Generative: assess the accuracy of the generated CoT by utilizing knowledge graphs (KGs).
In addition, we do ablation studies to evaluate the fine-grain CoT generation to calculate edit-distance & reasoning errors.
conda create --name llm-reasoning-cert python=3.8
conda activate llm-reasoning-cert
pip install -r requirements.txt
The paper uses 2 datasets: CWQ and GrailQA as initiate datasets for experiments.
Then, extract subgraph and ground-truth reasoning path based on SPARQL.
Final datasets used for the paper are uploaded into HuggingFace: (Note: update later)
Aim: create subgraphs for querying ground-truth reasoning path & creating VectorDB
Code at ./preprocess_data
- Create subgraph from the raw-subgraph via the detail implementation in preprocess's readme
- Get groundtruth reasoning path via the subgraph,
answer entities
andtopic entities
python ./preprocess_data/ground_truth_paths.py
- Rearrange questions according to the number of edge of groundtruth reasoning path
python ./preprocess_data/splitted_ground_truth_paths.py
We only use questions >=2 hops in the corresponding reasoning path.
FAISS
& sentence-transformers/all-mpnet-base-v2
are used to create VectorDB before retrieving
DATASET='cwq' # 'grail_qa
sbatch scripts/gen-cert/extract_triplet.sh $DATASET
you can setup addition arguments:
- embed_model_name. Default is
sentence-transformers/all-mpnet-base-v2
- top_k. Default is
10
- device. Default is
cpu
!Note: remember re-setup them in ./generative-cert.py#L228
Download data at here
- Negative generation model:
replace
: replace the entities in reasoning paths.reorder
: reorder the reasoning paths.mislead
: generate the reasoning paths leading to incorrect answers.
- Code:
# 1. Generate supgraph for misguide paths
python preprocess_data/subgraph_discriminative_cert.py
# 2. Generate negative paths:
## - CWQ dataset
python gen_negative.py --data_path data/cwq_test_res.csv --kg_path data/cwq_test.jsonl_cwq_test.jsonl --mode {'mislead', 'reorder', 'replace'}
## - GrailQA dataset
python gen_negative.py --data_path data/multi_hop_grailqa.csv --kg_path data/grail_w_kg.jsonl --mode {'mislead', 'reorder', 'replace'}
Set your OpenAI api key & Huggingface key (if needed) in .env
(check file .env.example
as the example).
- Evaluation for ground-truth reasoning paths
sh scripts/disc-cert/submit_discriminative_cert.sh
- Evaluation for generated negative reasoning paths
sh scripts/disc-cert/submit_discriminative_cert_neg.sh
- Get results
python scripts/disc-cert/summary_results.py
- ChatGPT
sh scripts/gen-cert/llm_prompting.sh
- HF models: Llama2 7B/13B/70B chat-hf, Mistral-7B-Instruct-v0.1, Qwen-14B-Chat, Vicuna-33b-v1.3
sh generative_cert/scripts/fitcluster/script.sh
- Main result
sh scripts/gen-cert/job_eval_llm.sh
- The fine-grained generative evaluation: edit-distance score
sh scripts/gen-cert/job_eval_llm_finegrained.sh
python finegrained_analysis.py
- Run the analysis for reasoning errors
python finegrained_analysis.py
If you find this paper or the repo useful for your work, please consider citing the paper
@misc{nguyen2024direct,
title={Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs},
author={Minh-Vuong Nguyen and Linhao Luo and Fatemeh Shiri and Dinh Phung and Yuan-Fang Li and Thuy-Trang Vu and Gholamreza Haffari},
year={2024},
eprint={2402.11199},
archivePrefix={arXiv},
primaryClass={cs.CL}
}