Skip to content

Latest commit

 

History

History
32 lines (22 loc) · 1.64 KB

README.md

File metadata and controls

32 lines (22 loc) · 1.64 KB

PIÖTOST : A Manually Revised Lombard-Italian Parallel Corpus

This is a manually revised version of the available Lombard-Italian parallel corpus from the WikiMatrix project (https://opus.nlpl.eu/WikiMatrix-v1.php) on OPUS (https://opus.nlpl.eu/). These where checked by five Eastern Lombard (Brescian) speakers.

UPDATE 20221202 : we added the FLORES-200 (https://github.com/facebookresearch/flores/tree/main/flores200) Lombard and Italian sections as dev and devtest. UPDATE 20230110 : we kept only the last 100 lines from the FLORES-200 as valid and test sets, and added the rest to the training set.

Corpus statistics

N. of pairs N. of words Avg. sentence length
LMO IT LMO IT
5306 122.550 113.385 23.10 21.37 train
997 25.531 22.984 25.61 23.05 dev
1012 26.954 24.311 26.63 24.02 devtest

*stats for original corpus

If you use this work, please cite it as:

Signoroni, E. (2022). Piötòst Ché Niènt, Mèi Piötòst-A Manually Revised Lombard-Italian Parallel Corpus. RASLAN 2022 Recent Advances in Slavonic Natural Language Processing, 105.

@article{signoroni2022piotost, title={Pi{"o}t{`o}st Ch{'e} Ni{`e}nt, M{`e}i Pi{"o}t{`o}st-A Manually Revised Lombard-Italian Parallel Corpus}, author={Signoroni, Edoardo}, journal={RASLAN 2022 Recent Advances in Slavonic Natural Language Processing}, pages={105}, year={2022} }