Skip to content

Segmentation data used in multilingual alignment tasks across English, French, Spanish, and other languages. Includes raw and segmented text files for training and evaluation.

Notifications You must be signed in to change notification settings

ProMeText/multilingual-segmentation-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multilingual Segmentation Dataset Banner

License: CC BY-NC-SA 4.0 DOI GitHub release Last Commit Issues Slides

✂️ Multilingual Segmentation Dataset

From manuscripts to models: a multilingual corpus for sentence segmentation in historical prose.

This dataset gathers carefully segmented excerpts from a wide range of textual genres — including narrative, didactic, legal, theological, and scholarly prose — spanning seven Romance and Latin languages (13th–16th c.).
Segment boundaries reflect both historical syntax and editorial conventions, making the corpus suitable for training and evaluating sentence segmentation models, as well as for cross-linguistic and diachronic analysis in NLP and digital philology.

📚 Documentation

📖 Overview

This dataset was developed to train a multilingual sentence segmentation model, used as a pre-processing step in the automatic alignment of historical texts with Aquilign, a multilingual alignment tool developed by our team.
Once the BERT-based models are trained and selected, they are integrated into the alignment workflow to segment texts based on learned boundary recognition — a critical step preceding alignment itself.

The segmented excerpts serve as input for Aquilign, enabling multilingual alignment across structurally and editorially diverse texts.
A first study applying this pipeline — focused on Lancelot en prose — was presented in the 2024 article Textual Transmission without Borders, published in the Computational Humanities Research (CHR) conference proceedings.

As the project evolved, the segmentation corpus was gradually expanded alongside the tool. Initially limited to three Romance languages — Castilian (es), French (fr), and Italian (it) — it was later enriched with Portuguese (pt), Catalan (ca), Latin (la), and English (en), thereby increasing linguistic diversity and strengthening the robustness of cross-linguistic alignment.

The corpus provides training and evaluation material for sentence-level segmentation in historical prose from the 13th to 16th centuries.
Texts were selected for their genre diversity and their ability to reflect editorial, orthographic, and linguistic variation across time, geography, and scribal practices.

To support reproducibility and multilingual evaluation, the dataset is structured by language.
Segmented data are stored under data/segmented/, with language-specific files organized as follows:

  • data/segmented/pre_split/<lang>/ — complete segmented lines per language
  • data/segmented/split/monolingual/<lang>/ — train/dev/test JSON and TXT files
  • data/segmented/split/multilingual/ — multilingual train/dev/test splits

🧾 Summary

Category Details
Languages Latin (la), French (fr), English (en), Portuguese (pt), Catalan (ca), Italian (it), Castilian (es)
Period Covered 13th–16th centuries
Text Formats Plain text (TXT), XML, with some material converted from HTML or PDF
Segmentation Manual sentence segmentation using language-specific criteria
License CC BY-NC-SA 4.0 – annotations and segmentation metadata only

🎯 Purpose

This dataset aims to support the training of machine learning models that can detect sentence and segment boundaries in non-standardized historical texts.

Reliable segmentation is essential for:

  • downstream NLP tasks such as parsing, translation, and alignment,
  • enhancing the accessibility and reusability of medieval sources,
  • enabling cross-linguistic comparison and advancing philological and historical-linguistic research.

📄 For full segmentation principles, see the detailed Segmentation Guidelines.

➡️ For model training instructions, architecture, and evaluation, see
Model Documentation.

🔄 Processing Pipeline

The segmentation pipeline involves the following steps, from raw historical texts to segmented training data.

Processing pipeline

See segmentation pipeline documentation for full details on each step.

🌐 Data Collection Variability Across Languages

📦 For notes on text acquisition, sourcing variation, and metadata standardization, see
➡️ docs/data_collection_notes.md

📊 Corpus Size

Below we present the data corresponding to the most recent version of the corpus size.
Older versions can be consulted in the release tags. The current version of the corpus includes segmented excerpts in seven historical languages, prepared for sentence segmentation tasks.

Each excerpt is annotated using the pound sign (£) to mark segment boundaries, typically corresponding to sentences or syntactic units.
The corpus does not include part-of-speech tagging or syntactic annotation — only sentence-level segmentation.

Language Tokens Segments (£) Avg. Tokens per Segment
Latin 68,058.0 13,387.0 5.08
French 80,907.0 12,168.0 6.65
Castilian 72,750.0 11,605.0 6.27
Portuguese 54,201.0 10,477.0 5.17
Catalan 49,891.0 7,983.0 6.25
Italian 50,943.0 7,783.0 6.55
English 36,138.0 6,107.0 5.92
Total 412,888.0 69,510.0 5.94

🗒️ Legend:

  • Tokens: Total number of tokens (excluding the £ symbol and punctuation).
  • Segments (£): Number of segment boundaries marked by £.
  • Avg. Tokens per Segment: Average number of tokens per segment (Tokens ÷ Segments).

* Note: The total average is the overall tokens divided by the total number of segments, not the average of column averages.

ℹ️ This corpus focuses on sentence segmentation only. It does not include POS tagging, syntactic trees, or named entity annotations.

📂 Data Location

The most up-to-date segmented data are stored in the repository under:

data/segmented/
  • This folder contains the current working version of the segmented texts.
  • For frozen snapshots corresponding to published versions (e.g. baseline, augmented), please refer to the release tags.

🙏 Credits

We gratefully acknowledge the following scholars for their contributions of source material or expertise:

  • Peter Stokes & Mark Faulkner – Guidance on available Middle English corpora
  • Sadurní Martí – Support in identifying Medieval Catalan corpora
  • Andrea Menozzi – Insights into available Medieval Italian corpora

🚧 Project Status

This corpus is part of an ongoing project. While it is already being used for segmentation and alignment tasks, further improvements, refinements, and corrections are expected.
We welcome feedback, error reports, and contributions to help improve the resource over time.

Please note:

  • Some segmentations may be revised in future updates.
  • Metadata and annotations are subject to enhancement.
  • Additional languages and texts will be added as the project evolves.

🔗 Related Projects

This repository is part of a broader ecosystem of tools and corpora developed for the study of medieval multilingual textual traditions:

  • Aquilign
    A clause-level multilingual alignment engine based on contextual embeddings (LaBSE), designed specifically for premodern texts.

  • Corpus Temporis App
    A Streamlit-based application for managing and structuring metadata of medieval multilingual texts.
    It provides the metadata that accompanies this dataset and supports its use in the Aquilign multilingual aligner.

🔮 Future Directions

  • Extend language coverage
  • Evaluate segmentation models
  • Broaden genre and period diversity
  • Encourage interdisciplinary use

📫 Contact & Contributions


📚 How to Cite this Dataset

Please cite as:

APA
Ing, L., Gille Levenson, M., & Macedo, C. (2025). Multilingual Segmentation Dataset for Historical Prose (13th–16th c.) (Version 1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.16992629

BibTeX

@dataset{ing2025multilingual,
  author       = {Ing, L. and Gille Levenson, M. and Macedo, C.},
  title        = {Multilingual Segmentation Dataset for Historical Prose (13th--16th c.)},
  year         = {2025},
  publisher    = {Zenodo},
  version      = {1.0},
  doi          = {10.5281/zenodo.16992629},
  url          = {https://doi.org/10.5281/zenodo.16992629},
  license      = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International}
}

🧾 Talks & Slides

🎤 Colloque Langues et Langage à la croisée des Disciplines (LLcD 2025)

Training Sentence Segmenters on Medieval Languages
👥 Lucence Ing, Matthias Gille Levenson, Carolina Macedo
📽️ View presentation slides (PDF)

💰 Funding

This work benefited from national funding managed by the Agence Nationale de la Recherche under the Investissements d'avenir programme with the reference ANR-21-ESRE-0005 (Biblissima+).

Ce travail a bénéficié d'une aide de l’État gérée par l’Agence Nationale de la Recherche au titre du programme d’Investissements d’avenir portant la référence ANR-21-ESRE-0005 (Biblissima+).

Biblissima+ Logo

📄 Licensing

All annotations, segmentations, and metadata are released under CC BY-NC-SA 4.0.

⚠️ Original textual content may be subject to source-specific licenses. Refer to the sources and corpus columns in the metadata CSV.

Jump to compiled data CSV ⤵️

About

Segmentation data used in multilingual alignment tasks across English, French, Spanish, and other languages. Includes raw and segmented text files for training and evaluation.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages