MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts

MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts

MultiOCR, a multilingual QA dataset designed to assess the impact of OCR errors on QA systems across English, French, and German. Our dataset, derived from centuries-old documents, provides a unique evaluation of OCR-induced challenges in real-world applications.

🗃️Dataset

Dataset Statistics

	English	French	German
#QA pairs	10,875	10,004	39,200
#Paragraphs	6,525	1,670	9,075
Average paragraph length (words)	219.09	297.53	212.86
Average question length (words)	10.98	8.73	8.08
Average answer length (words)	2.05	3.12	5.63
Average questions per paragraph	1.67	5.99	4.32

Data Structure:

{
    "document_id": "",
    "rawOCR_text": "",
    "correctedOCR_text": "",
    "QA_pairs": [
        {
            "q_id": "",
            "question": "",
            "answer": ""
        }
    ]
}

English QA: Download
French QA: Download
German QA: Download

🪪License

This project is licensed under the MIT License - see the LICENSE file for details.

✨Citation

If you find this work useful, please cite 📜our paper:

Plain

Piryani, B., Mozafari, J., Abdallah, A., Doucet, A., & Jatowt, A. (2025). MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts. arXiv preprint arXiv:2502.16781

Bibtex

@article{piryani2025multiocr,
  title={MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts},
  author={Piryani, Bhawna and Mozafari, Jamshid and Abdallah, Abdelrahman and Doucet, Antoine and Jatowt, Adam},
  journal={arXiv preprint arXiv:2502.16781},
  year={2025}
}

🙏Acknowledgments

Thanks to our contributors and the University of Innsbruck for supporting this project.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
Code		Code
Experiments		Experiments
Images		Images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts

🗃️Dataset

Dataset Statistics

🪪License

✨Citation

Plain

Bibtex

🙏Acknowledgments

About

Releases

Packages

Languages

License

DataScienceUIBK/MultiOCR-QA

Folders and files

Latest commit

History

Repository files navigation

MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts

🗃️Dataset

Dataset Statistics

🪪License

✨Citation

Plain

Bibtex

🙏Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages