Data, models, and code to reproduce our Pipeline and Dataset Generation for Automated Fact-checking in Almost Any Language paper. Currently in review for NCAA journal.
@article{drchal2023pipeline,
title={Pipeline and Dataset Generation for Automated Fact-checking in Almost Any Language},
author={Drchal, Jan and Ullrich, Herbert and Mlyn{\'a}{\v{r}}, Tom{\'a}{\v{s}} and Moravec, V{\'a}clav},
journal={arXiv preprint arXiv:2312.10171},
year={2023}
}
- QACG Data Generation -- our fork of the original QACG procedure.
- ColBERTv2 -- our fork of ColBERTv2. The retrieval for FactSearch is realized via REST API.
- anserini-indexing -- wrapper for ANSERINI BM25.The retrieval for FactSearch is realized via REST API.
- FactSearch source is hosted in this repository.
The following datasets were created by machine translation using DeepL. See the paper for more details.
- Question Generation model trained on a concatenation of Czech, English, Polish, and Slovak SQuAD datasets:
- Claim Generation model train on a concatenation of Czech, English, Polish, and Slovak QA2D datasets:
All QACG-generated datasets are based on the corresponding Wikipedia snapshots using the QACG models above. The QACG-mix combines all four languages, preserving the size of each individual language dataset. The QACG-sum is a four-times larger concatenation of all individual language datasets.