Zero-shot-Fact-Verification-by-Claim-Generation

This is a fork by AIC

This repository contains code and models for the paper: Zero-shot Fact Verification by Claim Generation (ACL-IJCNLP 2021).

We explore the possibility of automatically generating large-scale (evidence, claim) pairs to train the fact verification model.
We propose a simple yet general framework Question Answering for Claim Generation (QACG) to generate three types of claims from any given evidence: 1) claims that are supported by the evidence, 2) claims that are refuted by the evidence, and 3) claims that the evidence does Not have Enough Information (NEI) to verify.
We show that the generated training data can greatly benefit the fact verification system in both zero-shot and few-shot learning settings.

General Framework of QACG

Example of Generated Claims

Requirements

Python 3.7.3
torch 1.7.1
tqdm 4.49.0
transformers 4.3.3
stanza 1.1.1
nltk 3.5
scikit-learn 0.23.2
sense2vec

Data Preparation

The data used in our paper is constructed based on the original FEVER dataset. We use the gold evidence sentences in FEVER for the SUPPORTED and REFUTED claims. We collect evidence sentences for the NEI class using the retrival method proposed in the Papelo system from FEVER'2018. The detailed data processing process is introduced here.

Our processed dataset is publicly available in the Google Cloud Storage: https://storage.cloud.google.com/few-shot-fact-verification/

You could download them to the data folder using gsutil:

gsutil cp gs://few-shot-fact-verification/data/* ./data/

There are two files in the folder:

fever_train.processed.json
fever_dev.processed.json

One data sample is as follows:

{
    "id": 22846,
    "context": "Penguin Books was founded in 1935 by Sir Allen Lane as a line of the publishers The Bodley Head , only becoming a separate company the following year .",
    "ori_evidence": [
      [
        "Penguin_Books",
        1,
        "It was founded in 1935 by Sir Allen Lane as a line of the publishers The Bodley Head , only becoming a separate company the following year ."
      ]
    ],
    "claim": "Penguin Books is a publishing house founded in 1930.",
    "label": "REFUTES"
}

Claim Generation

Given a piece of evidence in FEVER, we generate three different types of claims: SUPPORTED, REFUTED, and NEI. The codes are in Claim_Generation folder.

a) NER Extraction

First, we extract all Name Entities (NERs) in the evidence.

mkdir -p ../output/intermediate/

python Extract_NERs.py \
    --train_path ../data/fever_train.processed.json \
    --dev_path ../data/fever_dev.processed.json \
    --save_path ../output/intermediate/

b) Question Generation

Then, we generate (question, answer) pairs from the evidence given an named entity as the answer.

For question generator, we use the pretrained QG model from patil-suraj, a Google T5 model finetuned on the SQuAD 1.1 dataset. Given an input text D and an answer A, the question generator outputs a question Q.

Run the following codes to generate (Q,A) pairs for the entire FEVER dataset.

python Generate_QAs.py \
    --train_path ../data/fever_train.processed.json \
    --dev_path ../data/fever_dev.processed.json \
    --data_split train \
    --entity_dict ../output/intermediate/entity_dict_train.json \
    --save_path ../output/intermediate/precompute_QAs_train.json

python Generate_QAs.py \
    --train_path ../data/fever_train.processed.json \
    --dev_path ../data/fever_dev.processed.json \
    --data_split dev \
    --entity_dict ../output/intermediate/entity_dict_dev.json \
    --save_path ../output/intermediate/precompute_QAs_dev.json

c) Claim Generation

We use the pretrained Sense2Vec (Trask et. al, 2015) to find answer replacements for generating REFUTED claims. The pretrained model can be downloaded here. Download the model and unzip it to the ./dependencies/ folder.

Then, download the pretrained QA2D model from the Google Cloud here. You could download them to the QA2D folder using gsutil:

gsutil cp gs://few-shot-fact-verification/QA2D_model/* ./dependencies/QA2D_model/

Finally, run Claim_Generation.py to generate claims from FEVER.

SUPPORTED Claims

Here is the example of generating SUPPORTED claims from the FEVER train set.

python Claim_Generation.py \
    --split train \
    --train_path ../data/fever_train.processed.json \
    --dev_path ../data/fever_train.processed.json \
    --entity_dict ../output/intermediate/entity_dict_train.json \
    --QA_path ../output/intermediate/precompute_QAs_train.json \
    --QA2D_model_path ../dependencies/QA2D_model \
    --sense_to_vec_path ../dependencies/s2v_old \
    --save_path ../output/SUPPORTED_claims.json \
    --claim_type SUPPORTED

REFUTED Claims

Here is the example of generating REFUTED claims from the FEVER train set.

python Claim_Generation.py \
    --split train \
    --train_path ../data/fever_train.processed.json \
    --dev_path ../data/fever_train.processed.json \
    --entity_dict ../output/intermediate/entity_dict_train.json \
    --QA_path ../output/intermediate/precompute_QAs_train.json \
    --QA2D_model_path ../dependencies/QA2D_model \
    --sense_to_vec_path ../dependencies/s2v_old \
    --save_path ../output/REFUTED_claims.json \
    --claim_type REFUTED

NEI Claims

Because generating NEI claims require additional contexts, we need access to the wikipedia dump associated with FEVER. The wiki dump can be downloaded with the following scripts:

wget https://s3-eu-west-1.amazonaws.com/fever.public/wiki-pages.zip
unzip wiki-pages.zip -d ./data

Here is the example of generating NEI claims from the FEVER train set.

python Claim_Generation.py \
    --split train \
    --train_path ../data/fever_train.processed.json \
    --dev_path ../data/fever_train.processed.json \
    --entity_dict ../output/intermediate/entity_dict_train.json \
    --QA_path ../output/intermediate/precompute_QAs_train.json \
    --QA2D_model_path ../dependencies/QA2D_model \
    --sense_to_vec_path ../dependencies/s2v_old \
    --save_path ../output/NEI_claims.json \
    --claim_type NEI \
    --wiki_path ../data/wiki-pages/