Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: TesseractOcrModel is sensitive to document orientation #1155

Open
ClemDoum opened this issue Mar 13, 2025 · 0 comments
Open

bug: TesseractOcrModel is sensitive to document orientation #1155

ClemDoum opened this issue Mar 13, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@ClemDoum
Copy link

Bug

When running DocumentConverter with TesseractOcrOptions I noticed that document which were not correctly oriented were not correctly processed.
Actually I observed the same behavior with EasyOCR but not with MacOS OCR.

However when using tesseract we can detect the page orientation using self.osd_reader.DetectOrientationScript(), rotate it and then perform OCR, which would improve recognition performance.

I will try to propose a fix soon.

Steps to reproduce

correct_orientation.pdf:

correct_orientation.pdf

wrong_orientation.pdf:

wrong_orientation.pdf

paths = ["correct_orientation.pdf", "wrong_orientation.pdf"]
ocr_options = TesseractOcrOptions(lang=["eng"], force_full_page_ocr=True)
pipeline_options = PdfPipelineOptions(do_ocr=True, ocr_options=ocr_options)
format_options = {
    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
converter = DocumentConverter(format_options=format_options)
markdowns = [
    res.document.export_to_markdown() for res in converter.convert_all(paths)
]
for m in markdowns:
    print(m + "\n\n")

outputs:

## i should be correctly parsea

whnetner correctly oriented or not


pesied AjOesI09 8g PNOUS |

JOU JO P9JUSHO A}O9IIOO JOUJOUM

Expected output:

## i should be correctly parsea

whnetner correctly oriented or not


pesied AjOesI09 8g PNOUS |

JOU JO P9JUSHO A}O9IIOO JOUJOUM

Docling version

Docling version: 2.26.0
Docling Core version: 2.21.2
Docling IBM Models version: 3.4.1
Docling Parse version: 3.4.0
Python: cpython-311 (3.11.9)
Platform: macOS-14.6-arm64-arm-64bit

Python version

Python 3.11.9

@ClemDoum ClemDoum added the bug Something isn't working label Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant