Skip to content
This repository has been archived by the owner on Jun 14, 2018. It is now read-only.

[Enhancement]: Propagate ocr confidence to output hocr file #86

Merged
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 27 additions & 11 deletions src/pyocr/builders.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ class Box(object):
was used.
"""

def __init__(self, content, position):
def __init__(self, content, position, confidence=None):
"""
Arguments:
content --- a single string
Expand All @@ -53,15 +53,17 @@ def __init__(self, content, position):
content = to_unicode(content)
self.content = content
self.position = position
self.confidence = confidence

def get_unicode_string(self):
"""
Return the string corresponding to the box, in unicode (utf8).
This string can be stored in a file as-is (see write_box_file())
and reread using read_box_file().
"""
return to_unicode("%s %d %d %d %d") % (
return to_unicode("%s %s %d %d %d %d") % (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is, it will break tesseract.CharBoxBuilder.
CharBoxBuilder correspond to a file format specific to Tesseract (configuration 'makebox'). If you modify this function, the format won't be the same than Tesseract anymore, and CharBoxBuilder.read_file() won't be able to read files written by CharBoxBuilder.write_file() anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted changes

self.content,
self.confidence,
self.position[0][0],
self.position[0][1],
self.position[1][0],
Expand All @@ -71,9 +73,10 @@ def get_unicode_string(self):
def get_xml_tag(self, parent_doc):
span_tag = parent_doc.createElement("span")
span_tag.setAttribute("class", "ocrx_word")
span_tag.setAttribute("title", ("bbox %d %d %d %d" % (
span_tag.setAttribute("title", ("bbox %d %d %d %d; x_wconf %d" % (
(self.position[0][0], self.position[0][1],
self.position[1][0], self.position[1][1]))))
self.position[1][0], self.position[1][1],
self.confidence))))
txt = xml.dom.minidom.Text()
txt.data = self.content
span_tag.appendChild(txt)
Expand Down Expand Up @@ -268,7 +271,7 @@ def start_line(self, box):
"""
raise NotImplementedError("Implement in subclasses")

def add_word(self, word, box):
def add_word(self, word, box, confidence):
"""
Add a word to output.
"""
Expand Down Expand Up @@ -329,7 +332,7 @@ def write_file(file_descriptor, text):
def start_line(self, box):
self.built_text.append(u"")

def add_word(self, word, box):
def add_word(self, word, box, confidence=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Either make confidence optional for all add_word() or for none. But the API of the builders must be the same for all of them.
  2. I would recommend '0' as default value instead. Seems safer.

Copy link
Member

@jflesch jflesch Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you missed my point 1 :-)
Either you specify a default value for confidence on all the Builders.add_word() methods (making the argument optional), or on none of them. All builders must all have the same API (see the base class builders.BaseBuilder)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry I did miss you point! The parameter has now been made optional for all add_word methods

if self.built_text[-1] != u"":
self.built_text[-1] += u" "
self.built_text[-1] += word
Expand Down Expand Up @@ -381,12 +384,23 @@ def __init__(self):

self.__current_box_position = None
self.__current_box_text = None
self.__current_box_confidence = None
self.boxes = []

self.__current_line_position = None
self.__current_line_content = []
self.lines = []

@staticmethod
def __parse_confidence(title):
for piece in title.split("; "):
piece = piece.strip()
if not piece.startswith("x_wconf"):
continue
confidence = piece.split(" ")[1]
return int(confidence)
raise Exception("Invalid hocr confidence measure: %s" % title)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This will be a problem. Paperwork uses this code to write and read hOCR files. In other words, there are already a lot of people (me included) with a lot of documents/hOCR files written without the confidence.
    This code must not raise an exception if the confidence is not found. However it can display a trace instead (this is not really unexpected nor a problem --> not a warning --> logger.info).
    In other words, please do no break the compatibility (API or file formats) :-)

  2. The message is not correct. The problem here is not that the confidence is invalid. The problem is that it hasn't been found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected!


@staticmethod
def __parse_position(title):
for piece in title.split("; "):
Expand All @@ -413,7 +427,9 @@ def handle_starttag(self, tag, attrs):
return
if tag_type == 'ocr_word' or tag_type == 'ocrx_word':
try:
confidence = self.__parse_confidence(position)
position = self.__parse_position(position)
self.__current_box_confidence = confidence
self.__current_box_position = position
except Exception:
# invalid position --> old format --> we ignore this tag
Expand All @@ -439,7 +455,7 @@ def handle_endtag(self, tag):
if self.__current_box_text is None:
return
box_position = self.__current_box_position
box = Box(self.__current_box_text, box_position)
box = Box(self.__current_box_text, box_position, self.__current_box_confidence)
self.boxes.append(box)
self.__current_line_content.append(box)
self.__current_box_text = None
Expand Down Expand Up @@ -596,8 +612,8 @@ def write_file(file_descriptor, boxes):
def start_line(self, box):
pass

def add_word(self, word, box):
self.word_boxes.append(Box(word, box))
def add_word(self, word, box, confidence):
self.word_boxes.append(Box(word, box, confidence))
Copy link
Member

@jflesch jflesch Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_word() functions are called by pyocr.libtesseract.image_to_string(). Since you haven't updated libtesseract support, this change will break it.

I suggest you try running the tests. Make sure you have cuneiform, tesseract, libtesseract, tesseract-ocr-fra, and tesseract-ocr-jpn installed, and simply run ./run_tests.py.

Unfortunately, the outputs of the tests slightly vary based on the exact version of Tesseract and Cuneiform you're using (and wind direction I guess ....). So you will have to filter the failed tests manually: ignore those that failed just because the output has slightly changed, and just focus on the ones that failed because the API broke due to your changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point! Here is what I did:

  • I've implemented the changes in libtesseract/__init__.py and libtesseract/tesseract_raw.py to support the extraction of the confidence measure using the C-API
  • The confidence parameter for the Box constructor has a default value of 0

All in all, this basically means that the confidence measure is propagated to the output hocr files for the tesseract and libtesseract interfaces and a value of 0 is used for all the words when using cuneiform. Do you think that makes sense?

I've done some manual testing with Cuneiform, Tesseract and libtesseract to verify everything was working as expected (looking at the output hocr files). There are however still many failing unit-tests when running the test suite. I must say that it's quite hard to know if it's because I broke the API or just because tesseract feels like spitting out a different output 😅

Copy link
Member

@jflesch jflesch Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know, they are a pain. Unfortunately I haven't found a better way :/
Hint: if they fail on assertEqual, it's usually that the output differs from what was expected (aka the wind is blowing north now). For this change, you ignore those tests and focus on those where an uncatched Exception has been raised.


def end_line(self):
pass
Expand Down Expand Up @@ -680,8 +696,8 @@ def start_line(self, box):
return
self.lines.append(LineBox([], box))

def add_word(self, word, box):
self.lines[-1].word_boxes.append(Box(word, box))
def add_word(self, word, box, confidence):
self.lines[-1].word_boxes.append(Box(word, box, confidence))

def end_line(self):
pass
Expand Down