This repo contains data for HindiRC. Please cite our paper when using our dataset.
HindiRC is the first Reading Comprehension dataset for Hindi.
The dataset has been split into grades 2-5 according the educational reading proficiency.
The data has been sourced from educational websites https://sandeepbarouli.com/ and 2classnotes. Hence, it is completely natural and formal.
- Reading Comprehension consists of a passage, <passage> and a set of questions, <q>.
- For HindiRC, the answers <a> belong to the passage.
- The index of these answers in the passage is given in <l>.
- If an answer spans multiple sentences, we consider the most important sentence only. However, the dataset does provide the index of these sentences is given in <lg>.
- Additionally, the dataset also provides a copy of the passage with all anaphors resolved.
The following tags have been used:
Data | Xml tags |
---|---|
Passage - raw data | <passage> |
Passage data with anaphoras replaced with referent. This was done manually. | <anaphoraresolved> |
Stemmed version of the passage | <stemmed> |
Question - raw data | <q> |
Answer - raw data | <a> |
Answer - index in passage | <l> |
Multi sentence answer - index in passage | <lg> |
Please cite our paper when using this dataset:
Kaveri Anuranjana*, Vijjini Anvesh Rao*, Radhika Mamidi, "HindiRC: A Dataset for Reading Comprehension in Hindi", 20th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing, 2019.
@inproceedings{anuranjana2019hindirc,
title = {HindiRC: A Dataset for Reading Comprehension in Hindi},
author = {Anuranjana, Kaveri and Rao, Vijjini Anvesh and Mamidi, Radhika},
booktitle = {20th International Conference on Computational Linguistics and Intelligent Text},
year = {2019}
}