Skip to content

Latest commit

 

History

History
20 lines (13 loc) · 823 Bytes

README.md

File metadata and controls

20 lines (13 loc) · 823 Bytes

alto-ocr-confidence

Calculates the OCR confidence score per page in ALTO files.

The method used is really simple:

  • find all String elements
  • get value of attribute "(WC)" (word confidence) for each String
  • calculate sum of all "WC" values
  • divide sum by the count of words per page

Use like:

python alto_ocr_confidence.py <inputdir>

Example output:

File: alto\AZ_1926_04_25_0001.xml, Confidence: 54.13

Note that OCR confidence (which is a native output of the OCR engine) is NOT equal to the actual OCR accuracy, which can only be determined by evaluation against Ground Truth.

Read more about OCR evaluation here.