This is a manually revised version of the available Lombard-Italian parallel corpus from the WikiMatrix project (https://opus.nlpl.eu/WikiMatrix-v1.php) on OPUS (https://opus.nlpl.eu/). These where checked by five Eastern Lombard (Brescian) speakers.
UPDATE 20221202 : we added the FLORES-200 (https://github.com/facebookresearch/flores/tree/main/flores200) Lombard and Italian sections as dev and devtest. UPDATE 20230110 : we kept only the last 100 lines from the FLORES-200 as valid and test sets, and added the rest to the training set.
N. of pairs | N. of words | Avg. sentence length | |||
---|---|---|---|---|---|
LMO | IT | LMO | IT | ||
5306 | 122.550 | 113.385 | 23.10 | 21.37 | train |
997 | 25.531 | 22.984 | 25.61 | 23.05 | dev |
1012 | 26.954 | 24.311 | 26.63 | 24.02 | devtest |
*stats for original corpus
If you use this work, please cite it as:
Signoroni, E. (2022). Piötòst Ché Niènt, Mèi Piötòst-A Manually Revised Lombard-Italian Parallel Corpus. RASLAN 2022 Recent Advances in Slavonic Natural Language Processing, 105.
@article{signoroni2022piotost, title={Pi{"o}t{`o}st Ch{'e} Ni{`e}nt, M{`e}i Pi{"o}t{`o}st-A Manually Revised Lombard-Italian Parallel Corpus}, author={Signoroni, Edoardo}, journal={RASLAN 2022 Recent Advances in Slavonic Natural Language Processing}, pages={105}, year={2022} }