bug: `TesseractOcrModel` is sensitive to document orientation #1155

ClemDoum · 2025-03-13T10:14:21Z

Bug

When running DocumentConverter with TesseractOcrOptions I noticed that document which were not correctly oriented were not correctly processed.
Actually I observed the same behavior with EasyOCR but not with MacOS OCR.

However when using tesseract we can detect the page orientation using self.osd_reader.DetectOrientationScript(), rotate it and then perform OCR, which would improve recognition performance.

I will try to propose a fix soon.

Steps to reproduce

correct_orientation.pdf:

correct_orientation.pdf

wrong_orientation.pdf:

wrong_orientation.pdf

paths = ["correct_orientation.pdf", "wrong_orientation.pdf"]
ocr_options = TesseractOcrOptions(lang=["eng"], force_full_page_ocr=True)
pipeline_options = PdfPipelineOptions(do_ocr=True, ocr_options=ocr_options)
format_options = {
    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
converter = DocumentConverter(format_options=format_options)
markdowns = [
    res.document.export_to_markdown() for res in converter.convert_all(paths)
]
for m in markdowns:
    print(m + "\n\n")

outputs:

## i should be correctly parsea

whnetner correctly oriented or not


pesied AjOesI09 8g PNOUS |

JOU JO P9JUSHO A}O9IIOO JOUJOUM

Expected output:

## i should be correctly parsea

whnetner correctly oriented or not


pesied AjOesI09 8g PNOUS |

JOU JO P9JUSHO A}O9IIOO JOUJOUM

Docling version

Docling version: 2.26.0
Docling Core version: 2.21.2
Docling IBM Models version: 3.4.1
Docling Parse version: 3.4.0
Python: cpython-311 (3.11.9)
Platform: macOS-14.6-arm64-arm-64bit

Python version

Python 3.11.9

The text was updated successfully, but these errors were encountered:

ClemDoum added the bug Something isn't working label Mar 13, 2025

ClemDoum mentioned this issue Mar 14, 2025

fix(ocr): tesseract support mis-oriented documents #1167

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: `TesseractOcrModel` is sensitive to document orientation #1155

bug: `TesseractOcrModel` is sensitive to document orientation #1155

ClemDoum commented Mar 13, 2025

bug: TesseractOcrModel is sensitive to document orientation #1155

bug: TesseractOcrModel is sensitive to document orientation #1155

Comments

ClemDoum commented Mar 13, 2025

Bug

Steps to reproduce

Docling version

Python version

bug: `TesseractOcrModel` is sensitive to document orientation #1155

bug: `TesseractOcrModel` is sensitive to document orientation #1155