ParTUT is a morpho-syntactically annotated collection of Italian/French/English parallel sentences, which includes texts from different sources and representing different genres and domains, released in several formats. See also http://www.di.unito.it/~tutreeb/treebanks.html .
ParTUT comprises approximately 167,000 tokens, with an average amount of 2,100 sentences per language. The texts of the collection currently available were gathered from a large number of sources and domains:
- the Creative Commons open license;
- the DGT-Translation Memory
- the Europarl parallel corpus (section ep_00_01_17);
- publicly available pages from Facebook website;
- the JRC-Acquis multilingual parallel corpus (section jrc52006DC243);
- several articles from Project Syndicate©;
- the Universal Declaration of Human Rights;
- Wikipedia articles retrieved in the English section and then translated into Italian only by graduate students in Translation Studies [ABSENT in French section];
- the Web Inventory of Translated Talks .
Since release 2.0, ParTUT is also available in the Universal Dependencies format (see here for English, here for French, and here for Italian).
If you use the resource, please cite:
- Manuela Sanguinetti, Cristina Bosco. 2014. PartTUT: The Turin University Parallel Treebank. In Basili, Bosco, Delmonte, Moschitti, Simi (editors) Harmonization and development of resources and tools for Italian Natural Language Processing within the PARLI project, LNCS, Springer Verlag