From manuscripts to models: a multilingual corpus for sentence segmentation in historical prose.
This dataset gathers carefully segmented excerpts from a wide range of textual genres — including narrative, didactic, legal, theological, and scholarly prose — spanning seven Romance and Latin languages (13th–16th c.).
Segment boundaries reflect both historical syntax and editorial conventions, making the corpus suitable for training and evaluating sentence segmentation models, as well as for cross-linguistic and diachronic analysis in NLP and digital philology.
-
✂️ Segmentation Criteria
➡️docs/annotation_guidelines/segmentation_criteria_en.md
-
🧪 Model Architecture & Training
➡️docs/segmentation_model.md
-
🔧 Processing Pipeline (Raw → Segmented)
➡️docs/segmentation_processing_pipeline.md
-
🧾 Annotated Examples
➡️docs/segmentation_exemples.md
-
🌍 Data Collection & Source Tracking
➡️docs/data_collection_and_source_tracking.md
-
🔤 Delimiter Configuration (per language)
➡️docs/annotation_guidelines/main-word-delimiters.json
This dataset was developed to train a multilingual sentence segmentation model, used as a pre-processing step in the automatic alignment of historical texts with Aquilign, a multilingual alignment tool developed by our team.
Once the BERT-based models are trained and selected, they are integrated into the alignment workflow to segment texts based on learned boundary recognition — a critical step preceding alignment itself.
The segmented excerpts serve as input for Aquilign, enabling multilingual alignment across structurally and editorially diverse texts.
A first study applying this pipeline — focused on Lancelot en prose — was presented in the 2024 article Textual Transmission without Borders, published in the Computational Humanities Research (CHR) conference proceedings.
As the project evolved, the segmentation corpus was gradually expanded alongside the tool. Initially limited to three Romance languages — Castilian (es
), French (fr
), and Italian (it
) — it was later enriched with Portuguese (pt
), Catalan (ca
), Latin (la
), and English (en
), thereby increasing linguistic diversity and strengthening the robustness of cross-linguistic alignment.
The corpus provides training and evaluation material for sentence-level segmentation in historical prose from the 13th to 16th centuries.
Texts were selected for their genre diversity and their ability to reflect editorial, orthographic, and linguistic variation across time, geography, and scribal practices.
To support reproducibility and multilingual evaluation, the dataset is structured by language.
Segmented data are stored under data/segmented/
, with language-specific files organized as follows:
data/segmented/pre_split/<lang>/
— complete segmented lines per languagedata/segmented/split/monolingual/<lang>/
— train/dev/test JSON and TXT filesdata/segmented/split/multilingual/
— multilingual train/dev/test splits
Category | Details |
---|---|
Languages | Latin (la ), French (fr ), English (en ), Portuguese (pt ), Catalan (ca ), Italian (it ), Castilian (es ) |
Period Covered | 13th–16th centuries |
Text Formats | Plain text (TXT), XML, with some material converted from HTML or PDF |
Segmentation | Manual sentence segmentation using language-specific criteria |
License | CC BY-NC-SA 4.0 – annotations and segmentation metadata only |
This dataset aims to support the training of machine learning models that can detect sentence and segment boundaries in non-standardized historical texts.
Reliable segmentation is essential for:
- downstream NLP tasks such as parsing, translation, and alignment,
- enhancing the accessibility and reusability of medieval sources,
- enabling cross-linguistic comparison and advancing philological and historical-linguistic research.
📄 For full segmentation principles, see the detailed Segmentation Guidelines.
➡️ For model training instructions, architecture, and evaluation, see
Model Documentation.
The segmentation pipeline involves the following steps, from raw historical texts to segmented training data.
See segmentation pipeline documentation for full details on each step.
📦 For notes on text acquisition, sourcing variation, and metadata standardization, see
➡️ docs/data_collection_notes.md
Below we present the data corresponding to the most recent version of the corpus size.
Older versions can be consulted in the release tags.
The current version of the corpus includes segmented excerpts in seven historical languages, prepared for sentence segmentation tasks.
Each excerpt is annotated using the pound sign (£
) to mark segment boundaries, typically corresponding to sentences or syntactic units.
The corpus does not include part-of-speech tagging or syntactic annotation — only sentence-level segmentation.
Language | Tokens | Segments (£ ) |
Avg. Tokens per Segment |
---|---|---|---|
Latin | 68,058.0 | 13,387.0 | 5.08 |
French | 80,907.0 | 12,168.0 | 6.65 |
Castilian | 72,750.0 | 11,605.0 | 6.27 |
Portuguese | 54,201.0 | 10,477.0 | 5.17 |
Catalan | 49,891.0 | 7,983.0 | 6.25 |
Italian | 50,943.0 | 7,783.0 | 6.55 |
English | 36,138.0 | 6,107.0 | 5.92 |
Total | 412,888.0 | 69,510.0 | 5.94 |
- Tokens: Total number of tokens (excluding the
£
symbol and punctuation). - Segments (
£
): Number of segment boundaries marked by£
. - Avg. Tokens per Segment: Average number of tokens per segment (
Tokens ÷ Segments
).
* Note: The total average is the overall tokens divided by the total number of segments, not the average of column averages.
ℹ️ This corpus focuses on sentence segmentation only. It does not include POS tagging, syntactic trees, or named entity annotations.
The most up-to-date segmented data are stored in the repository under:
data/segmented/
- This folder contains the current working version of the segmented texts.
- For frozen snapshots corresponding to published versions (e.g. baseline, augmented), please refer to the release tags.
We gratefully acknowledge the following scholars for their contributions of source material or expertise:
- Peter Stokes & Mark Faulkner – Guidance on available Middle English corpora
- Sadurní Martí – Support in identifying Medieval Catalan corpora
- Andrea Menozzi – Insights into available Medieval Italian corpora
This corpus is part of an ongoing project. While it is already being used for segmentation and alignment tasks, further improvements, refinements, and corrections are expected.
We welcome feedback, error reports, and contributions to help improve the resource over time.
Please note:
- Some segmentations may be revised in future updates.
- Metadata and annotations are subject to enhancement.
- Additional languages and texts will be added as the project evolves.
This repository is part of a broader ecosystem of tools and corpora developed for the study of medieval multilingual textual traditions:
-
Aquilign
A clause-level multilingual alignment engine based on contextual embeddings (LaBSE), designed specifically for premodern texts. -
Corpus Temporis App
A Streamlit-based application for managing and structuring metadata of medieval multilingual texts.
It provides the metadata that accompanies this dataset and supports its use in the Aquilign multilingual aligner.
- Extend language coverage
- Evaluate segmentation models
- Broaden genre and period diversity
- Encourage interdisciplinary use
-
For academic collaboration, please reach out via GitHub Discussions
Please cite as:
APA
Ing, L., Gille Levenson, M., & Macedo, C. (2025). Multilingual Segmentation Dataset for Historical Prose (13th–16th c.) (Version 1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.16992629
BibTeX
@dataset{ing2025multilingual,
author = {Ing, L. and Gille Levenson, M. and Macedo, C.},
title = {Multilingual Segmentation Dataset for Historical Prose (13th--16th c.)},
year = {2025},
publisher = {Zenodo},
version = {1.0},
doi = {10.5281/zenodo.16992629},
url = {https://doi.org/10.5281/zenodo.16992629},
license = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International}
}
Training Sentence Segmenters on Medieval Languages
👥 Lucence Ing, Matthias Gille Levenson, Carolina Macedo
📽️ View presentation slides (PDF)
This work benefited from national funding managed by the Agence Nationale de la Recherche under the Investissements d'avenir programme with the reference ANR-21-ESRE-0005 (Biblissima+).
Ce travail a bénéficié d'une aide de l’État gérée par l’Agence Nationale de la Recherche au titre du programme d’Investissements d’avenir portant la référence ANR-21-ESRE-0005 (Biblissima+).
All annotations, segmentations, and metadata are released under CC BY-NC-SA 4.0.
⚠️ Original textual content may be subject to source-specific licenses. Refer to thesources
andcorpus
columns in the metadata CSV.