TEI17

This repository contains layout analysis and OCR from 17^th books in TEI files and their ODD.

Production

Thoses files were created thanks to a pipeline :

Segmentation and transcription with eScriptorium, using models from datasetsOCRSegmenter17 github repository
Manual correction of ALTO4 files extracted from eScriptorium
Python script pipeline to transform those ALTO4 files in a unique TEI file (see Extractor repository) , adding some metadata (extracted from manifest IIIF and SPARQL requests in data.bnf.fr).

How TEI file is built ?

This TEI file tries to stick at most to TEI all documentation.

So it contains :

teiHeader in which there is all metadata recovered with manifest IIIF and SPARQL request, some information about encoding (use of SegmOnto vocabulary, some information about book's printer(s)
facsimile in which is all layout informations about different zones, lines, and baselines, with pixels coordinates and links to IIIF images
text in which is all transcription, linked to the concerned line

Credits

Documents have been encoded by Claire Jahan with the help of Simon Gabay, as part of the E-ditiones project.

Contact

Claire Jahan : claire.jahan[at]chartes.psl.eu

Simon Gabay : Simon.Gabay[at]unige.ch

Licence

This repository is CC-BY.

Cite this repository

Claire Jahan, Simon Gabay. 2021. CORPUS17+ - Corpus of TEI encoded 17th French prints., Paris/Geneva: ENS Paris/UniGE, 2021, https://github.com/Heresta/CORPUS17plus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TEI17

Production

How TEI file is built ?

Credits

Contact

Licence

Cite this repository

Files

README.md

Latest commit

History

README.md

File metadata and controls

TEI17

Production

How TEI file is built ?

Credits

Contact

Licence

Cite this repository