Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF recognition #38

Open
dstillman opened this issue Sep 4, 2018 · 4 comments
Open

PDF recognition #38

dstillman opened this issue Sep 4, 2018 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@dstillman
Copy link
Member

Tentative plan:

  1. When downloading a URL, either make a HEAD request first to see if the URL is a PDF or, if possible, gracefully handle PDF downloads in Zotero.HTTP.request() with a maximum download size.

  2. Add another endpoint that accepts PDF data.

  3. Once we have the PDF data, upload that to a new recognizer-server endpoint.

  4. recognizer-server might send the PDF data to a Lambda for pdftotext processing, or it might be in Lambda itself if we move the DB from SQLite to MySQL

  5. translation-server gets back identifiers from recognizer-server, runs translation on them, and returns metadata

@dstillman dstillman added the enhancement New feature or request label Sep 4, 2018
@mrtcode
Copy link
Member

mrtcode commented Nov 16, 2018

So PDF must be recognized when:

  1. User enters a PDF URL
  2. User uploads a PDF file

For 1) we have to take over the URL if it turned out to be a PDF URL, and instead of processing it with translators, upload it to S3 and trigger further processing. Currently if Zotero.HTTP.request is set to return a document, it treats all files as a document doesn't matter if it is PDF or HTML. That's the first thing what we need to fix. I think we should check response-type if it is text/html, and only then process the content with JSDOM. Otherwise the request function should just return raw data.

Then we'll need to update processDocuments and functions that call it. There should be a condition that checks if Zotero.HTTP.request returned a document which should be passed to translators, or it returned a PDF file which should be uploaded to S3.

Next, we should limit download size, but with request.js we can only do that by manually listening on stream and counting bytes. I think the file should be limited to 50MB.

Now for option 2), the client should firstly get a signed URL from t-s, then upload a file and then query t-s again.

@theFool32
Copy link

Hey, I wonder is there any progress on it?
I believe this feature does make sense.

@monperrus
Copy link

FYI, a notable software library to extract metadata from PDFs is grobid: https://github.com/kermitt2/grobid

@alexkreidler
Copy link

The https://github.com/zotero/recognizer-server repo is not publicly available, apparently because it isn't self-contained: https://forums.zotero.org/discussion/80101/zotero-service-for-metadata-extraction. What external APIs does the service rely on? Stuff like AWS/GCP/Azure OCR services? Then we could figure out how to make it modular so users could use open source alternatives locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

5 participants