Skip to content

Ground truth line annotations for the Berliner Börsen-Zeitung

Notifications You must be signed in to change notification settings

AniketGurav/bbz-ocr-train

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

This repository contains annotations for our mixed typeface/mixed context (body text and table text) OCR training corpus built by randomly sampling over 1000 pages from the Berliner Börsen-Zeitung. An effort was made to balance the amount of pages containing Fraktur or Antiqua respectively as well as the amount of lines drawn from each year.

To view/edit annotations you need the following artefacts:

  1. images of full pages
  2. line segmentation for each page
  3. annotations for selected lines (annotations.db)

To obtain 1. and 2., use one of these two:

  1. Mini Version (1 GB). Intended for viewing text annotations and for trying things out.
  2. Full Version (6,8 GB). Intended for editing annotations and exporting training data.

Now unzip annotate.db.zip from this repository, put annotate.db inside the images folder, then run:

python -m origami.tool.annotate /path/to/images

Releases

No releases published

Packages

No packages published