GitHub - AniketGurav/bbz-ocr-train: Ground truth line annotations for the Berliner Börsen-Zeitung

This repository contains annotations for our mixed typeface/mixed context (body text and table text) OCR training corpus built by randomly sampling over 1000 pages from the Berliner Börsen-Zeitung. An effort was made to balance the amount of pages containing Fraktur or Antiqua respectively as well as the amount of lines drawn from each year.

To view/edit annotations you need the following artefacts:

images of full pages
line segmentation for each page
annotations for selected lines (annotations.db)

To obtain 1. and 2., use one of these two:

Mini Version (1 GB). Intended for viewing text annotations and for trying things out.
Full Version (6,8 GB). Intended for editing annotations and exporting training data.

Now unzip annotate.db.zip from this repository, put annotate.db inside the images folder, then run:

python -m origami.tool.annotate /path/to/images

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
annotations.db.zip		annotations.db.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

AniketGurav/bbz-ocr-train

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages