This study aims to address the gap in existing work that primarily focuses on the fact-verification part of the fact-checking process rather than evidence retrieval, leading to scalability issues for practical applications. Here we make use of various methods for identifying and indexing supporting facts and enhancing the retrieval phase of the fact-checking pipeline. More information about our work can be found here.
├── *data
│ ├── db_files
│ ├── embed_files
│ ├── enwiki_files
│ ├── hover_files
│ ├── hover
│ ├── wice
│ ├── jpq_doc
├── *model
├── *out
│ ├── hover
│ │ ├── exp1.0
│ │ ├── bm25
│ │ ├── faiss
│ ├── wice
│ │ ├── exp1.0
│ │ ├── bm25
│ │ ├── faiss
├── scripts
├── src
│ ├── hover
│ ├── retrieval
│ │ ├── bm25
│ │ ├── faiss
│ │ ├── JPQ
│ │ ├── unq
│ ├── tools
│ ├── wikiextractor
├── .gitignore
├── grounding_env.yml
└── hover_env.yml
The (sub)folders of data, model, and out are instantiated when setting up the project. The folder structure is laid out as follows:
- data: Should contain all corpus data (wikipedia files), claim data and generally intermediate data for running the pipeline.
- model: Should contain the document and query encoders as well as possible custom models for reranking.
- out: will contain the model checkpoints for each HoVer pipeline stage as well as the predictions for each checkpoint.
- scripts: Bash scripts for running the three retrieval pipeline settings (bm25, faiss, jpq) as well as shell scripts for setting up data folder and downloading wikipedia dump.
- src: Contains the main hover pipeline, retrieval folder containing the retrieval methods, tools for misecellenous helper code and forked wikiextractor for processing the wikipedia dump.
For installing all dependencies, we recommend using Anaconda and installing the grounding_env.yml
file. Please ensure that the environment is named "grounding" as the scripts will explicitly attempt to activate it. Alternatively, rename all instances in the scripts folder.
Since HoVer has a somewhat outdated codebase, and to avoid breaking existing working code, a separate environment YAML file, hover_env.yml
, has been created with older dependencies. Similarly to the "grounding" environment, ensure that this environment is named "hover".
foo@bar:~$ conda env create --file=grounding_env.yml
foo@bar:~$ conda env create --file=hover_env.yml
The intial first step is creating the necessary folders to hold data to run the pipeline:
foo@bar:~$ ./scripts/download_data.sh
For our experiments we used the following corpus and claim data:
- processed 2017 English wikipedia dump provided by HotPotQA with the HoVer as claim data.
- processed 2023 English Wikipedia dump from Wikimedia with the WiCE as claim data.
{ ... },
{
"id": 47094971,
"url": "https://en.wikipedia.org/wiki?curid=47094971",
"title": "Dark Souls",
"text": [["Dark Souls"], ["Dark Souls is a series of action role-playing games</a> developed by <a href=\"/wiki/FromSoftware\" >FromSoftware </a> and published by <a href=\"/wiki/Bandai_Namco_Entertainment\">Bandai Namco Entertainment</a>. ", ...], ... ],
},
{ ... },
To use a different wikipedia dump, download it from aforementioned wikimedia dump site (for example the latest one). To simplify the process we automated it into the following script (requires dependencies as explained further down):
foo@bar:~$ ./scripts/preprocess_wiki_dump.sh
The exact steps this script performs are as follows. To use the data for the pipeline first some preprocessing needs to be done using HotPotQA's forked WikiExtractor to get the above format. First install the WikiExtractor as pip module and afterwards prepocess it with the module:
foo@bar:~$ cd src/wikiextractor
foo@bar:~$ pip install .
foo@bar:~$ cd ../..
foo@bar:~$ python -m src.wikiextractor.WikiExtractor enwiki-latest-pages-articles.xml.bz2 -o data/enwiki_files/enwiki-latest --no-templates -c --json
Lastly, as the sentences are still concatenated some further processing is needed. This requires a NLP model for sentence splitting. In our case we provide implementation to use either StanfordCoreNLP as well as spaCy's en_core_web_lg model.
For StanfordCoreNLP run inside the folder on a seperate terminal:
foo@bar:~$ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize,ssplit,pos,lemma,parse,sentiment" -port 9000 -timeout 30000
For spaCY install the package with the correct model:
foo@bar:~$ python -m spacy download en_core_web_lg
Lastly, for splitting the english wikipedia articles sentences (with --use_spacy being an optional argument to pass along):
foo@bar:~$ python -m src.tools.enwiki_processing --start_loc=enwiki-latest --end_loc=enwiki-latest-original --process_function=preprocess
For the reranking setup, one would first need to pre-compute the supporting facts in the data corpus (wikipedia files). This can be invoked with the following command:
usage: enwiki_processing.py [-h] --first_loc=FIRST_LOC --second_loc=SECOND_LOC --process_function=PROCESS_FUNCTION [--first_paragraph_only] [--do_mmr] [--use_spacy]
positional arguments:
--first_loc Input Folder containing wiki bz2 files to process for
--second_loc Output Folder containing wiki bz2 files to save the processed files to.
For fusion this will be folder to combine from instead.
--process_function Type of processing on the corpus:
- preprocess: Sentence splitting text of wiki-extracted dump.
- claim_detect: Perform claim-detection using Huggingface model.
- wice: Perform claim-detection using a custom model. Note: Not
used in our work but serves as example how to integrate in our code.
- cite: Extract citations from corpus. Requires online connection
due to asynchronous scraping as well as either the StanfordCoreNLP
being run in the background or downloaded spaCy model in-place.
- fusion: Combine two corpus datasets by searching for entries in
the first location that are empty and fill them in with data from
the second location.
Note: for all reranking setups, this will create an additional field
for each article 'fact_text' instead of overwriting the 'text' field.
optional arguments:
-h, --help show this help message and exit
--mmr Store top-k sentences from the supporting fact extraction. Used only in
the custom model code.
--use_spacy Use spaCy model instead of StanfordCoreNLP for sentence splitting. Used
for process_function preprocess or cite .
Commands for recreating our experiments:
foo@bar:~$ python -m src.tools.enwiki_processing --start_loc=enwiki-latest-original --end_loc=enwiki-latest-claim --process_function=claim_detect
foo@bar:~$ python -m src.tools.enwiki_processing --start_loc=enwiki-latest-original --end_loc=enwiki-latest-cite --process_function=cite --use_spacy
foo@bar:~$ python -m src.tools.enwiki_processing --start_loc=enwiki-latest-cite --end_loc=enwiki-latest-claim --process_function=fusion
Before running the pipeline, the following step of converting the data corpus to database files is also required.
usage: data_processing.py [-h] --setting=SETTING [--split_sent] [--first_para_only] [--store_original]
positional arguments:
--setting The (reranked) corpus files to process for. Converts the bz2 files
structure into single database file containing title as id and text as
values. For the text, the sentences will be concatenated with [SENT] as
tokenizer in-between. Additionally temporarily creates and measures a
single json file containing all entries for the corpus size.
optional arguments:
-h, --help show this help message and exit
--split_sent Store invidiual sentences in the database instead of the concatenated
text per wikipedia article.
--first_para_only Only process for the first paragraph and saves database file with
`-first` suffix else with `-full` suffix. Not used in our work due to
Wikipedia not having citations in the lead sections of their articles.
--store_original Stores the 'text' field values of the wikipedia bz2 files into the
database instead of the 'fact_text'. Only used for the non-reranked
data corpus in our experiments.
Commands for recreating our experiments:
foo@bar:~$ python -m src.tools.data_processing --setting=enwiki-latest-original --store_original --pre_compute_embed
foo@bar:~$ python -m src.tools.data_processing --setting=enwiki-latest-claim --pre_compute_embed
foo@bar:~$ python -m src.tools.data_processing --setting=enwiki-latest-cite --pre_compute_embed
foo@bar:~$ python -m src.tools.data_processing --setting=enwiki-latest-fusion --pre_compute_embed
For running the entire HoVer pipeline following our three of adjustments, we have created the following scripts to run.
foo@bar:~$ ./scripts/run_bm25_pipeline.sh CLAIM_NAME SETTING BM25_TYPE
Requires Elasticsearch instance to be run in the background.
# start process
foo@bar:~$ ./elasticsearch-8.11.3/bin/elasticsearch -d -p pid
# kill process
foo@bar:~$ pkill -F ./elasticsearch-8.11.3/pid
Additionally when using BM25_TYPE 'original', requires sentence splitting to be in-place similar to the supporting fact extraction part
foo@bar:~$ ./scripts/run_faiss_pipeline.sh CLAIM_NAME SETTING HOVER_STAGE RETRIEVAL_MODE
Pre-computing the vector embeddings can speed-up the index construction part. Requires the Sentence tranformers model all-MiniLM-L6-v2.
foo@bar:~$ ./scripts/run_compress_pipeline.sh CLAIM_NAME SETTING HOVER_STAGE RETRIEVAL_MODE SUBVECTORS
Training encoders from scratch is possible, by passing the ensuring there is no encoders folder (setting name) in the data/jpq_doc
folder. For more information go to src/retrieval
.
Explanation arguments:
arguments:
CLAIM_NAME Name of the claim dataset to run for [hover | wice]
SETTING Name of the data corpus to run for, in other words the name of the database file
BM25_TYPE Run for original pipeline setting or reranking which skips sentence
selection stage (original)
HOVER_STAGE Perform immediate claim verification stage or include sentence
selection (sent_select)
RETRIEVAL_MODE Perform cpu or gpu retrieval (default: cpu)
SUBVECTORS Amount of subvectors to use (default: 96)
For more information on retrieval methods, read the README inside src/retrieval
.
For more information on each HoVer stage, read the README inside src/hover
.
Below follow the work from which we utilised the existing code base and modified certain parts of it.
"HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification" in Findings of EMNLP, 2020. (paper | code).
"WiCE: Real-World Entailment for Claims in Wikipedia" in Findings of EMNLP, 2023. (paper | code).
"Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance" CIKM, 2021. (paper | code)
"Unsupervised Neural Quantization for Compressed-Domain Similarity Search" (paper | code)
*"WikiExtractor" (code)
To ensure consistent code style and readability, this project uses auto-formatting tools such as black and isort. Additionally for readability and reproducibility purposes, we use type hints for majority of the functions we created and use docstrings following the Google format.