The CREMMA WIKIPEDIA project aims at creating a collection of ground truth to train HTR models on contemporary French handwriting.
Each image represents an exerpt from a randomly selected Wikipedia page, copied by hand by volunteers. We then took care of the alignment between the handwritten portion and the original text, also present on the image.
The transcription guidelines follow CREMMA's convention for modern documents. In short:
- superscript is preceded by a
^
. - Strikethrough elements are transcribed with
><
when unreadable,>word<
when readeable.
The text to copy may have included phonetic transcription. Non-french letters and diacritics were rendered as well. See characters.csv for the list of the characters used in this dataset. The character set can be normalized using ChocoMufin
- wikicremma: file generator for the CREMMA-Wikipedia corpus
- cremmawiki-anonymizer: anonymized image generator for the CREMMA-Wikipedia corpus
This work is licensed under a Creative Commons Attribution 4.0 International License.