This repo contains the official implementation for the paper Understanding and Patching Compositional Reasoning in LLMs (ACL'2024, Findings)
The environment yaml files are located in the ./environments
dictionary. Note that we have two environments: one for investigating experiments (inference, logit lens, causal intervention and locating), another one for patching experiments (creme).
- For investigating experiments: run the command
conda env create -n investigate environments/investigating/environment.yaml
; then activate the environmentconda activate investigate
when running investigating experiments. - For patching experiments: run the command
conda env create -n patch environments/patching/environment.yaml
; then activate the environmentconda activate patch
when running patching experiments.
For data please switch the working path into the ./data
dictionary and run cd mquake
command.
- The original MQuAKE-CF data (2hop split):
MQuAKE-CF-3k.2hop.json
. This data file is directly used for inference experiments (compositionality gap). - For the inference experiments, to align the LLMs' output format to the answer space (i.e., directly output the answer rather than other common words (e.g., "OK", "Cool", repeating the question and stuff)), we use few-shot prompts to instruct models. The templates could be found in
./data/mquake/prompts
. - For inspecting, causality and locating experiments: we use
comp_cloze_prefix.json
(orcomp_cloze_suffix.json
), where we paraphrase knowledge items in the Cloze-Test form (i.e., (subject, relation, object): The creator of C. Auguste Dupin is __ (waiting for completion), in align with previous works, e.g., ROME, Memory Injections, Dissecting Factual Recall and stuff). - For editing (patching) experiments: we construct
MQuAKE-CF-3k.2hop.edit.json
, where we sample paraphrasing set, generalization set and irrelavance set (please refer to the paper for detailed introduction) for each testing case on the basis ofMQuAKE-CF-3k.2hop.json
. Note that in theMQuAKE-CF-3k.2hop.edit.json
, the irrelevant testing cases might be noisy (share the answer with the case to be patched). Hence we re-sample irrelevant cases in./creme/make_dataset/make_dataset_irrelevant.py
.
To run inference experiments (Compositionality Gap, Compositional Reasoning Errors), please change the working path into the inference
dictionary (cd inference/MQuAKE
).
- To run inference for single-hop questions, run
python inference_single.py <model_name>
, where<model_name>
could bellama2-7b
,llama2-13b
oropenalpace-3b
. After finishing running the inference program, there will be a result file (<model_name>.json
) automatically stored in theinference/MQuAKE/single-hop
dictionary. - To run inference for compositional two-hop questions, run
python inference_comp.py <model_name>
. After finishing running the inference program, there will be a result file (<model_name>.json
) automatically stored in theinference/MQuAKE/compositional
dictionary. - After fetching the inference results for both single-hop questions and compositional two-hop questions, we can run
python filter.py <model_name> <fix_type>
to classify results into two categories: (1)pass_all
which means that the LLM can correctly answer both single-hop questions and the corresponding compositional ones; (2)pass_singles_fail_comp
which means that the LLM though correctly both single-hop questions, fail to solve the compositional ones (Regarding the Compositionality Gap, Compositional Reasoning Errors). Both these two parts of results will be seperately stored into two files in theinference/MQuAKE/filter
dictionary. - Notes:
<fix_type>
could beprefix
orsuffix
, indicating two different orders when composing two single-hop questions. This is for furture usage. Besides, in each single testing case, there are three paraphrasing compositional questions (share the same meaning) to test the model. Following the original MQuAKE paper, we regard the model pass the testing as long as it can correctly answer one of the three paraphrased questions.
The following three parts (logit lens inspection, intervention experiments and locating experiments) are in the inspecting_and_intervention
dictionary, which was implemented highly on the basis of ROME's official implementation (this is an acknowledgement!).
- To run Logit Lens examples, please switch the working path:
cd inspecting_and_intervention
and run the program:python logit_lens.py
. Note that the testing example is hard-coded in the program (so we need to mannually modify the program to test different cases). Successful running the program will generate a logit lens curve figure in theinspecting_and_intervention/logit_lens/results
dictionary.
To run causal intervention experiments, first change the working path cd inspecting_and_intervention/causal_intervention
.
- First fetch causal intervention data: run the command
python fetch.py <fix_type> <model_name>
, where<fix_type>
could beprefix
orsuffix
;<model_name>
could bellama2-7b
oropenalpaca-3b
. This program will fetch intervention data and organize them as a file<model_name>.<fix_type>.json
in the current dictionary. - To run the causal intervention experiment: run the command
python causality.py <model_name> <fix_type>
. This program will generate a result file<model_name>.<fix_type>.json
in theresults
dictionary. - To aggregate the results (average over instances) and visualize them: first switch the working path into the
results
dictionarycd results
and run the commandpython aggregate_visualize.py <model_name> <fix_type>
. Successful running will generate a heatmap figure in the same dictionary.
To run locating experiments, first change the working path cd inspecting_and_intervention/locating
.
- To run the locating experiments, run the command
python locating.py <model_name> <fix_type>
. Successful running the program will generate a result file<model_name>.<fix_type>.json
in theinspecting_and_intervention/locating/results
dictionary. - To aggregate the results (average over instances) and visualize them: first switch the working path into the
results
dictionarycd results
and run the commandpython aggregate_visualize.py <model_name> <fix_type>
. Successful running will generate a heatmap figure in the same dictionary.
To run the patching experiments, first change the working path cd creme
. This part of codes was constructed on the basis of FastEdit.
- To prepare the editing data, first
cd make_dataset
.- For correction, paraphrasing and generalization testing, run
python make_dataset.py <model_name>
(llama2-7b
oropenalpaca-3b
). - For irrelevant testing, run
python make_dataset_irre.py <model_name>
. - After generating editing data in the current dictionary path, go back to the previous folder
cd ..
.
- For correction, paraphrasing and generalization testing, run
- To get statistical results, first
cd fastedit_comp
. Run the commandbash test_batch.sh
to get correction, paraphrasing and generalization testing results. Run the commandbash test_batch_irre.sh
to get irrelevant testing results.- The results can be found in
results/v0
(for non-irrelevant testing) orresults/irrelevant
(for irrelevant testing). Runningpython results/aggregate.py <testing_type>
(testing_type = v0 or irrelevant) can generate the averaged results.
- The results can be found in
- To test a single case,
- Prepare the testing case in
creme/data
, following the format ofexample.json
(the nationality, creator, C. Auguste Dupin case). - Switch the working path
cd fastedit_comp
and then run the commandtest.sh
. The output content could be viewed in thetesting.txt
file.
- Prepare the testing case in
If you find the paper or the repo is helpful, it would be very lovely that you consider citing the paper (with the following bibtex):
@inproceedings{li-etal-2024-understanding,
title = "Understanding and Patching Compositional Reasoning in {LLM}s",
author = "Li, Zhaoyi and
Jiang, Gangwei and
Xie, Hong and
Song, Linqi and
Lian, Defu and
Wei, Ying",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
month = aug,
year = "2024",
address = "Bangkok, Thailand and virtual meeting",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-acl.576",
pages = "9668--9688",
abstract = "LLMs have marked a revolutonary shift, yet they falter when faced with compositional reasoning tasks. Our research embarks on a quest to uncover the root causes of compositional reasoning failures of LLMs, uncovering that most of them stem from the improperly generated or leveraged implicit reasoning results. Inspired by our empirical findings, we resort to Logit Lens and an intervention experiment to dissect the inner hidden states of LLMs. This deep dive reveals that implicit reasoning results indeed surface within middle layers and play a causative role in shaping the final explicit reasoning results. Our exploration further locates multi-head self-attention (MHSA) modules within these layers, which emerge as the linchpins in accurate generation and leveraing of implicit reasoning results. Grounded on the above findings, we develop CREME, a lightweight method to patch errors in compositional reasoning via editing the located MHSA modules. Our empirical evidence stands testament to CREME{'}s effectiveness, paving the way for autonomously and continuously enhancing compositional reasoning capabilities in language models.",
}