✂️ Multilingual Segmentation Dataset

From manuscripts to models: a multilingual corpus for sentence segmentation in historical prose.

This dataset gathers carefully segmented excerpts from a wide range of textual genres — including narrative, didactic, legal, theological, and scholarly prose — spanning seven Romance and Latin languages (13th–16th c.).
Segment boundaries reflect both historical syntax and editorial conventions, making the corpus suitable for training and evaluating sentence segmentation models, as well as for cross-linguistic and diachronic analysis in NLP and digital philology.

📚 Documentation

✂️ Segmentation Criteria
➡️ docs/annotation_guidelines/segmentation_criteria_en.md
🧪 Model Architecture & Training
➡️ docs/segmentation_model.md
🔧 Processing Pipeline (Raw → Segmented)
➡️ docs/segmentation_processing_pipeline.md
🧾 Annotated Examples
➡️ docs/segmentation_exemples.md
🌍 Data Collection & Source Tracking
➡️ docs/data_collection_and_source_tracking.md
🔤 Delimiter Configuration (per language)
➡️ docs/annotation_guidelines/main-word-delimiters.json

📖 Overview

This dataset was developed to train a multilingual sentence segmentation model, used as a pre-processing step in the automatic alignment of historical texts with Aquilign, a multilingual alignment tool developed by our team.
Once the BERT-based models are trained and selected, they are integrated into the alignment workflow to segment texts based on learned boundary recognition — a critical step preceding alignment itself.

The segmented excerpts serve as input for Aquilign, enabling multilingual alignment across structurally and editorially diverse texts.
A first study applying this pipeline — focused on Lancelot en prose — was presented in the 2024 article Textual Transmission without Borders, published in the Computational Humanities Research (CHR) conference proceedings.

As the project evolved, the segmentation corpus was gradually expanded alongside the tool. Initially limited to three Romance languages — Castilian (es), French (fr), and Italian (it) — it was later enriched with Portuguese (pt), Catalan (ca), Latin (la), and English (en), thereby increasing linguistic diversity and strengthening the robustness of cross-linguistic alignment.

The corpus provides training and evaluation material for sentence-level segmentation in historical prose from the 13th to 16th centuries.
Texts were selected for their genre diversity and their ability to reflect editorial, orthographic, and linguistic variation across time, geography, and scribal practices.

To support reproducibility and multilingual evaluation, the dataset is structured by language.
Segmented data are stored under data/segmented/, with language-specific files organized as follows:

data/segmented/pre_split/<lang>/ — complete segmented lines per language
data/segmented/split/monolingual/<lang>/ — train/dev/test JSON and TXT files
data/segmented/split/multilingual/ — multilingual train/dev/test splits

🧾 Summary

Category	Details
Languages	Latin (`la`), French (`fr`), English (`en`), Portuguese (`pt`), Catalan (`ca`), Italian (`it`), Castilian (`es`)
Period Covered	13th–16th centuries
Text Formats	Plain text (TXT), XML, with some material converted from HTML or PDF
Segmentation	Manual sentence segmentation using language-specific criteria
License	CC BY-NC-SA 4.0 – annotations and segmentation metadata only

🎯 Purpose

This dataset aims to support the training of machine learning models that can detect sentence and segment boundaries in non-standardized historical texts.

Reliable segmentation is essential for:

downstream NLP tasks such as parsing, translation, and alignment,
enhancing the accessibility and reusability of medieval sources,
enabling cross-linguistic comparison and advancing philological and historical-linguistic research.

📄 For full segmentation principles, see the detailed Segmentation Guidelines.

➡️ For model training instructions, architecture, and evaluation, see
Model Documentation.

🔄 Processing Pipeline

The segmentation pipeline involves the following steps, from raw historical texts to segmented training data.

See segmentation pipeline documentation for full details on each step.

🌐 Data Collection Variability Across Languages

📦 For notes on text acquisition, sourcing variation, and metadata standardization, see
➡️ docs/data_collection_notes.md

📊 Corpus Size

Below we present the data corresponding to the most recent version of the corpus size.
Older versions can be consulted in the release tags. The current version of the corpus includes segmented excerpts in seven historical languages, prepared for sentence segmentation tasks.

Each excerpt is annotated using the pound sign (£) to mark segment boundaries, typically corresponding to sentences or syntactic units.
The corpus does not include part-of-speech tagging or syntactic annotation — only sentence-level segmentation.

Language	Tokens	Segments (`£`)	Avg. Tokens per Segment
Latin	68,058.0	13,387.0	5.08
French	80,907.0	12,168.0	6.65
Castilian	72,750.0	11,605.0	6.27
Portuguese	54,201.0	10,477.0	5.17
Catalan	49,891.0	7,983.0	6.25
Italian	50,943.0	7,783.0	6.55
English	36,138.0	6,107.0	5.92
Total	412,888.0	69,510.0	5.94

🗒️ Legend:

Tokens: Total number of tokens (excluding the £ symbol and punctuation).
Segments (£): Number of segment boundaries marked by £.
Avg. Tokens per Segment: Average number of tokens per segment (Tokens ÷ Segments).

* Note: The total average is the overall tokens divided by the total number of segments, not the average of column averages.

ℹ️ This corpus focuses on sentence segmentation only. It does not include POS tagging, syntactic trees, or named entity annotations.

📂 Data Location

The most up-to-date segmented data are stored in the repository under:

data/segmented/

This folder contains the current working version of the segmented texts.
For frozen snapshots corresponding to published versions (e.g. baseline, augmented), please refer to the release tags.

🙏 Credits

We gratefully acknowledge the following scholars for their contributions of source material or expertise:

Peter Stokes & Mark Faulkner – Guidance on available Middle English corpora
Sadurní Martí – Support in identifying Medieval Catalan corpora
Andrea Menozzi – Insights into available Medieval Italian corpora

🚧 Project Status

This corpus is part of an ongoing project. While it is already being used for segmentation and alignment tasks, further improvements, refinements, and corrections are expected.
We welcome feedback, error reports, and contributions to help improve the resource over time.

Please note:

Some segmentations may be revised in future updates.
Metadata and annotations are subject to enhancement.
Additional languages and texts will be added as the project evolves.

🔗 Related Projects

This repository is part of a broader ecosystem of tools and corpora developed for the study of medieval multilingual textual traditions:

Aquilign
A clause-level multilingual alignment engine based on contextual embeddings (LaBSE), designed specifically for premodern texts.
Corpus Temporis App
A Streamlit-based application for managing and structuring metadata of medieval multilingual texts.
It provides the metadata that accompanies this dataset and supports its use in the Aquilign multilingual aligner.

🔮 Future Directions

Extend language coverage
Evaluate segmentation models
Broaden genre and period diversity
Encourage interdisciplinary use

📫 Contact & Contributions

Open an issue or pull request
For academic collaboration, please reach out via GitHub Discussions

📚 How to Cite this Dataset

Please cite as:

APA
Ing, L., Gille Levenson, M., & Macedo, C. (2025). Multilingual Segmentation Dataset for Historical Prose (13th–16th c.) (Version 1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.16992629

BibTeX

@dataset{ing2025multilingual,
  author       = {Ing, L. and Gille Levenson, M. and Macedo, C.},
  title        = {Multilingual Segmentation Dataset for Historical Prose (13th--16th c.)},
  year         = {2025},
  publisher    = {Zenodo},
  version      = {1.0},
  doi          = {10.5281/zenodo.16992629},
  url          = {https://doi.org/10.5281/zenodo.16992629},
  license      = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International}
}

🧾 Talks & Slides

🎤 Colloque Langues et Langage à la croisée des Disciplines (LLcD 2025)

Training Sentence Segmenters on Medieval Languages
👥 Lucence Ing, Matthias Gille Levenson, Carolina Macedo
📽️ View presentation slides (PDF)

💰 Funding

This work benefited from national funding managed by the Agence Nationale de la Recherche under the Investissements d'avenir programme with the reference ANR-21-ESRE-0005 (Biblissima+).

Ce travail a bénéficié d'une aide de l’État gérée par l’Agence Nationale de la Recherche au titre du programme d’Investissements d’avenir portant la référence ANR-21-ESRE-0005 (Biblissima+).

📄 Licensing

All annotations, segmentations, and metadata are released under CC BY-NC-SA 4.0.

⚠️ Original textual content may be subject to source-specific licenses. Refer to the sources and corpus columns in the metadata CSV.

Jump to compiled data CSV ⤵️

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
.idea		.idea
assets		assets
data		data
docs		docs
scripts		scripts
CHANGELOG.md		CHANGELOG.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

✂️ Multilingual Segmentation Dataset

📚 Documentation

📖 Overview

🧾 Summary

🎯 Purpose

🔄 Processing Pipeline

🌐 Data Collection Variability Across Languages

📊 Corpus Size

🗒️ Legend:

📂 Data Location

🙏 Credits

🚧 Project Status

🔗 Related Projects

🔮 Future Directions

📫 Contact & Contributions

📚 How to Cite this Dataset

🧾 Talks & Slides

🎤 Colloque Langues et Langage à la croisée des Disciplines (LLcD 2025)

💰 Funding

📄 Licensing

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

ProMeText/multilingual-segmentation-dataset

Folders and files

Latest commit

History

Repository files navigation

✂️ Multilingual Segmentation Dataset

📚 Documentation

📖 Overview

🧾 Summary

🎯 Purpose

🔄 Processing Pipeline

🌐 Data Collection Variability Across Languages

📊 Corpus Size

🗒️ Legend:

📂 Data Location

🙏 Credits

🚧 Project Status

🔗 Related Projects

🔮 Future Directions

📫 Contact & Contributions

📚 How to Cite this Dataset

🧾 Talks & Slides

🎤 Colloque Langues et Langage à la croisée des Disciplines (LLcD 2025)

💰 Funding

📄 Licensing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages