Alternative pdf processing #20

schoeppe · 2019-10-18T15:14:31Z

Maybe this alternative implementation is useful:
Because Elasticsearch with the ingest-attachment plugin/Apache Tika does a good job indexing the text parts of PDFs we only OCR the image parts.

This implementation depends on the Linux command line tool pdfimages (sudo apt-get install poppler-utils) instead of ImageMagick and ghostscript for PDF processing. ImageMagick is still needed for processing images though, but the extra configuration for PDF processing (mentioned here) is not neccessary.

This could lead to a significant speedup when processing "real" PDFs.
Tested with the Nextcloud Manual the speedup was 350%.

ArtificialOwl · 2020-03-10T12:05:47Z

Hello @schoeppe and tanks for your patience.

I did some test on my side of your solution, but I have the same result. 179s with your PR against 183 with master. Are you sure about your speedup ?

schoeppe · 2020-03-11T19:57:30Z

Hi @daita,
if I remember correctly this implementation is not faster for PDFs containing only images, e.g. scanned documents. For these PDFs the performance should be about the same as the original implementation.

But for PDFs containing Text and Images, e.g. the "Nextcloud Manual" there should be a significant performance gain. Did you get your results from indexing the "Nextcloud Manual"?

ArtificialOwl · 2020-03-12T10:54:10Z

yes, I did some comparaison using Nextcloud Manual.pdf.

The thing is that your edit only affect the way to generate the image files from pdf, not the OCR itself which is done via tesseract, right ?

schoeppe · 2020-03-12T19:33:09Z

My version extracts all images embedded in the PDF. The OCR is the same.
I will do some more benchmarks when I have time and get back to you with the results!

schoeppe added 2 commits October 18, 2019 16:08

Alternative implementation extracting only the images from the PDFs

b6954d0

Removed dependencies not used by alternative implementation

a5f73b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative pdf processing #20

Alternative pdf processing #20

schoeppe commented Oct 18, 2019 •

edited

Loading

ArtificialOwl commented Mar 10, 2020

schoeppe commented Mar 11, 2020

ArtificialOwl commented Mar 12, 2020 •

edited

Loading

schoeppe commented Mar 12, 2020

Alternative pdf processing #20

Are you sure you want to change the base?

Alternative pdf processing #20

Conversation

schoeppe commented Oct 18, 2019 • edited Loading

ArtificialOwl commented Mar 10, 2020

schoeppe commented Mar 11, 2020

ArtificialOwl commented Mar 12, 2020 • edited Loading

schoeppe commented Mar 12, 2020

schoeppe commented Oct 18, 2019 •

edited

Loading

ArtificialOwl commented Mar 12, 2020 •

edited

Loading