Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative pdf processing #20

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

schoeppe
Copy link

@schoeppe schoeppe commented Oct 18, 2019

Maybe this alternative implementation is useful:
Because Elasticsearch with the ingest-attachment plugin/Apache Tika does a good job indexing the text parts of PDFs we only OCR the image parts.

This implementation depends on the Linux command line tool pdfimages (sudo apt-get install poppler-utils) instead of ImageMagick and ghostscript for PDF processing. ImageMagick is still needed for processing images though, but the extra configuration for PDF processing (mentioned here) is not neccessary.

This could lead to a significant speedup when processing "real" PDFs.
Tested with the Nextcloud Manual the speedup was 350%.

@ArtificialOwl
Copy link
Member

Hello @schoeppe and tanks for your patience.

I did some test on my side of your solution, but I have the same result. 179s with your PR against 183 with master. Are you sure about your speedup ?

@schoeppe
Copy link
Author

Hi @daita,
if I remember correctly this implementation is not faster for PDFs containing only images, e.g. scanned documents. For these PDFs the performance should be about the same as the original implementation.

But for PDFs containing Text and Images, e.g. the "Nextcloud Manual" there should be a significant performance gain. Did you get your results from indexing the "Nextcloud Manual"?

@ArtificialOwl
Copy link
Member

ArtificialOwl commented Mar 12, 2020

yes, I did some comparaison using Nextcloud Manual.pdf.

The thing is that your edit only affect the way to generate the image files from pdf, not the OCR itself which is done via tesseract, right ?

@schoeppe
Copy link
Author

My version extracts all images embedded in the PDF. The OCR is the same.
I will do some more benchmarks when I have time and get back to you with the results!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants