PDF recognition #38

dstillman · 2018-09-04T06:50:42Z

Tentative plan:

When downloading a URL, either make a HEAD request first to see if the URL is a PDF or, if possible, gracefully handle PDF downloads in Zotero.HTTP.request() with a maximum download size.
Add another endpoint that accepts PDF data.
Once we have the PDF data, upload that to a new recognizer-server endpoint.
recognizer-server might send the PDF data to a Lambda for pdftotext processing, or it might be in Lambda itself if we move the DB from SQLite to MySQL
translation-server gets back identifiers from recognizer-server, runs translation on them, and returns metadata

The text was updated successfully, but these errors were encountered:

mrtcode · 2018-11-16T14:16:49Z

So PDF must be recognized when:

User enters a PDF URL
User uploads a PDF file

For 1) we have to take over the URL if it turned out to be a PDF URL, and instead of processing it with translators, upload it to S3 and trigger further processing. Currently if Zotero.HTTP.request is set to return a document, it treats all files as a document doesn't matter if it is PDF or HTML. That's the first thing what we need to fix. I think we should check response-type if it is text/html, and only then process the content with JSDOM. Otherwise the request function should just return raw data.

Then we'll need to update processDocuments and functions that call it. There should be a condition that checks if Zotero.HTTP.request returned a document which should be passed to translators, or it returned a PDF file which should be uploaded to S3.

Next, we should limit download size, but with request.js we can only do that by manually listening on stream and counting bytes. I think the file should be limited to 50MB.

Now for option 2), the client should firstly get a signed URL from t-s, then upload a file and then query t-s again.

theFool32 · 2020-05-14T15:10:11Z

Hey, I wonder is there any progress on it?
I believe this feature does make sense.

monperrus · 2020-07-19T09:04:31Z

FYI, a notable software library to extract metadata from PDFs is grobid: https://github.com/kermitt2/grobid

alexkreidler · 2023-08-04T06:32:55Z

The https://github.com/zotero/recognizer-server repo is not publicly available, apparently because it isn't self-contained: https://forums.zotero.org/discussion/80101/zotero-service-for-metadata-extraction. What external APIs does the service rely on? Stuff like AWS/GCP/Azure OCR services? Then we could figure out how to make it modular so users could use open source alternatives locally.

dstillman added the enhancement New feature or request label Sep 4, 2018

dstillman assigned mrtcode Sep 4, 2018

mrtcode mentioned this issue Nov 28, 2018

Add PDF handling #59

Open

dstillman mentioned this issue Dec 23, 2018

Try translating PDF URLs based on URL #70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF recognition #38

PDF recognition #38

dstillman commented Sep 4, 2018

mrtcode commented Nov 16, 2018

theFool32 commented May 14, 2020

monperrus commented Jul 19, 2020

alexkreidler commented Aug 4, 2023

PDF recognition #38

PDF recognition #38

Comments

dstillman commented Sep 4, 2018

mrtcode commented Nov 16, 2018

theFool32 commented May 14, 2020

monperrus commented Jul 19, 2020

alexkreidler commented Aug 4, 2023