Skip to content

MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts

License

Notifications You must be signed in to change notification settings

DataScienceUIBK/MultiOCR-QA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Huggingface

MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts

MultiOCR, a multilingual QA dataset designed to assess the impact of OCR errors on QA systems across English, French, and German. Our dataset, derived from centuries-old documents, provides a unique evaluation of OCR-induced challenges in real-world applications.

🗃️Dataset

Dataset Statistics

English French German
#QA pairs 10,875 10,004 39,200
#Paragraphs 6,525 1,670 9,075
Average paragraph length (words) 219.09 297.53 212.86
Average question length (words) 10.98 8.73 8.08
Average answer length (words) 2.05 3.12 5.63
Average questions per paragraph 1.67 5.99 4.32

Data Structure:

{
    "document_id": "",
    "rawOCR_text": "",
    "correctedOCR_text": "",
    "QA_pairs": [
        {
            "q_id": "",
            "question": "",
            "answer": ""
        }
    ]
}

🪪License

This project is licensed under the MIT License - see the LICENSE file for details.

✨Citation

If you find this work useful, please cite 📜our paper:

Plain

Piryani, B., Mozafari, J., Abdallah, A., Doucet, A., & Jatowt, A. (2025). MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts. arXiv preprint arXiv:2502.16781

Bibtex

@article{piryani2025multiocr,
  title={MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts},
  author={Piryani, Bhawna and Mozafari, Jamshid and Abdallah, Abdelrahman and Doucet, Antoine and Jatowt, Adam},
  journal={arXiv preprint arXiv:2502.16781},
  year={2025}
}

🙏Acknowledgments

Thanks to our contributors and the University of Innsbruck for supporting this project.

About

MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages