This builds an n-way multilingual corpus, from the data in the awesome Tatoeba dataset. This allows you to do pivot-free zero-shot machine translation, as well as have unusual language combinations.
Usage is:
python3 intersect_tatoeba.py Spanish jpn English
The arguments are the languages that you want to intersect, either the ISO 639-3 names (eg. English) or codes (eg. eng
).
The output in this example will be corpus.jpn
, corpus.spa
, and corpus.eng
.
First download two files into this directory, as these are constantly being updated upstream:
wget -c http://downloads.tatoeba.org/exports/sentences.tar.bz2 && tar jxvf sentences.tar.bz2
wget -c http://downloads.tatoeba.org/exports/links.tar.bz2 && tar jxvf links.tar.bz2
Then run the script. Enjoy!
Here are some languages in the upstream dataset:
Language | ISO 639-3 Code | Sentences |
---|---|---|
English | eng | 641421 |
Esperanto | epo | 511221 |
Turkish | tur | 503109 |
Russian | rus | 479397 |
Italian | ita | 474880 |
German | deu | 366934 |
French | fra | 315677 |
Spanish | spa | 265058 |
Portuguese | por | 231807 |
Hungarian | hun | 191328 |
Japanese | jpn | 184296 |
Hebrew | heb | 153655 |
Berber | ber | 104842 |
(Hundreds more languages) |