CREMMA - Wikipedia

Description

The CREMMA WIKIPEDIA project aims at creating a collection of ground truth to train HTR models on contemporary French handwriting.

Each image represents an exerpt from a randomly selected Wikipedia page, copied by hand by volunteers. We then took care of the alignment between the handwritten portion and the original text, also present on the image.

Transcription guidelines

The transcription guidelines follow CREMMA's convention for modern documents. In short:

superscript is preceded by a ^.
Strikethrough elements are transcribed with
- >< when unreadable,
- >word< when readeable.

The text to copy may have included phonetic transcription. Non-french letters and diacritics were rendered as well. See characters.csv for the list of the characters used in this dataset. The character set can be normalized using ChocoMufin

Related tools

wikicremma: file generator for the CREMMA-Wikipedia corpus
cremmawiki-anonymizer: anonymized image generator for the CREMMA-Wikipedia corpus

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
badges		badges
data		data
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
characters.csv		characters.csv
cremma_wiki.png		cremma_wiki.png
htr-united.yml		htr-united.yml
public_metadata.csv		public_metadata.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CREMMA - Wikipedia

Description

Transcription guidelines

Related tools

License

About

Releases 8

Packages

Contributors 4

License

HTR-United/cremma-wikipedia

Folders and files

Latest commit

History

Repository files navigation

CREMMA - Wikipedia

Description

Transcription guidelines

Related tools

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 8

Packages 0

Contributors 4

Packages