NewsQA-es

NewsQA-es is a Spanish version of the NewsQA Dataset, created by researchers at Grupo PLN, UdelaR.

Obtaining the dataset

Due to license issues, we can't provide a download link. Therefore, here we provide the steps to re-create it by translating NewsQA. The steps:

Download the NewsQA dataset. Follow the steps in the NewsQA website to download the dataset.
Obtain the answers text with the tools from Maluuba NewsQA.
Translate every sentence and question. Follow the steps described in the next section.
Use a translation aligner to find the correspondence between each answer from NewsQA and a span of text from the translated sentence in Spanish. Follow the steps in the repo pln-fing-udelar/Mask-Align.

We translated the dataset using the Opus-MT model from Helsinki-NLP. To reproduce it (having already downloaded the NewsQA dataset):

Clone this repo:

git clone https://github.com/pln-fing-udelar/newsqa-es
cd newsqa-es/

Set up the environment using Conda:

conda env create
conda activate newsqa-es

Place the extracted CNN stories from the NewsQA dataset under cnn_stories/cnn/stories:
```
mkdir cnn_stories
tar -xvf cnn_stories.tgz -C cnn_stories/
```
Run the following command to translate the dataset. Consider that it takes time, and you may benefit from having a GPU. For reference, it takes a bit less than a day and a half on a computer with an Nvidia RTX 2080 Ti GPU. Consider changing the BATCH_SIZE constant to best fit your hardware (with a value that's too high you may incur in OOM errors; if it's too low you are underutilizing your resources, and it could be faster).
```
mkdir -p cnn_stories/cnn/translated
./translate.py
```
You will find the translated stories under the folder cnn_stories/cnn/translated/.

TODO: how to go from these files to the newsqa.csv file required in Mask-Align?

If you encounter issues following these steps, please open a GitHub issue or email us at pln@fing.edu.uy.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
translate.py		translate.py