Skip to content

A manually revised Lombard-Italian parallel corpus

Notifications You must be signed in to change notification settings

edoardosignoroni/piotost

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

PIÖTOST : A Manually Revised Lombard-Italian Parallel Corpus

This is a manually revised version of the available Lombard-Italian parallel corpus from the WikiMatrix project (https://opus.nlpl.eu/WikiMatrix-v1.php) on OPUS (https://opus.nlpl.eu/). These where checked by five Eastern Lombard (Brescian) speakers.

UPDATE 20221202 : we added the FLORES-200 (https://github.com/facebookresearch/flores/tree/main/flores200) Lombard and Italian sections as dev and devtest. UPDATE 20230110 : we kept only the last 100 lines from the FLORES-200 as valid and test sets, and added the rest to the training set.

Corpus statistics

N. of pairs N. of words Avg. sentence length
LMO IT LMO IT
5306 122.550 113.385 23.10 21.37 train
997 25.531 22.984 25.61 23.05 dev
1012 26.954 24.311 26.63 24.02 devtest

*stats for original corpus

If you use this work, please cite it as:

Signoroni, E. (2022). Piötòst Ché Niènt, Mèi Piötòst-A Manually Revised Lombard-Italian Parallel Corpus. RASLAN 2022 Recent Advances in Slavonic Natural Language Processing, 105.

@article{signoroni2022piotost, title={Pi{"o}t{`o}st Ch{'e} Ni{`e}nt, M{`e}i Pi{"o}t{`o}st-A Manually Revised Lombard-Italian Parallel Corpus}, author={Signoroni, Edoardo}, journal={RASLAN 2022 Recent Advances in Slavonic Natural Language Processing}, pages={105}, year={2022} }

About

A manually revised Lombard-Italian parallel corpus

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published