Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Searchable pdf comes out improperly rotated even though the images look fine [Exif rotation metadata problem] #16

Open
2 tasks
ElectricRCAircraftGuy opened this issue May 1, 2021 · 0 comments
Labels
bug Something isn't working

Comments

@ElectricRCAircraftGuy
Copy link
Owner

ElectricRCAircraftGuy commented May 1, 2021

Scenario:

I take some photos of documents with my phone. I download them. They are properly rotated. I cd into their dir and run pdf2searchablepdf ., which produces file ._searchable.pdf.

The PDF pages are improperly rotated though!

Double-clicking an image in Ubuntu to open it in the Ubuntu Image Viewer shows it is rotated properly, so what's wrong!

Well, it turns out the image contains "Exif orientation metadata" which tesseract is apparently ignoring! Open the image in GIMP and it will show the following:

This image contains Exif orientation metadata. Would you like to rotate the image?

image

So:

  1. Report this as a bug to tesseract.
  2. Do a fix meanwhile which will force a true rotation prior to running tesseract:
    sudo apt install exiftran
    cd path/to/dir_of_images
    exiftran -ai *.jpg
    See my answer here: https://superuser.com/a/1645862/425838.

I should also auto-enhance (whiten) the images with these 2 algorithms in Python in my answer here: https://stackoverflow.com/questions/48268068/how-do-i-do-the-equivalent-of-gimps-colors-auto-white-balance-in-python-fu/67343271#67343271. See also: https://superuser.com/questions/370920/auto-image-enhance-for-ubuntu.

And I should compress them with jpegoptim as I explain in my readme here: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF#image-size-notes.

@ElectricRCAircraftGuy ElectricRCAircraftGuy added the bug Something isn't working label May 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant