Skip to content
This repository has been archived by the owner on Nov 8, 2023. It is now read-only.
/ serenata-ocr Public archive

A Serverless API for OCRing Serenata de Amor's documents (currently limited to Chamber of Deputies receipts)

License

Notifications You must be signed in to change notification settings

fgrehm/serenata-ocr

Repository files navigation

Serenata OCR

A Serverless API for OCRing Serenata de Amor's documents (currently limited to Chamber of Deputies receipts). Powered by Claudia.JS and Google Cloud Vision.

From zero to an OCR API in minutes

serenata-ocr

https://asciinema.org/a/149404

Initial setup

In terms of tools / development stuff, while a Docker environment is in the works, this is what you'll need:

  • git clone git@github.com:fgrehm/serenata-ocr.git
  • cp config.json{.example,}
  • NodeJS 6.10 (:warning: This is important, it is the version executed in AWS Lambda).
  • yarn install or npm install
  • Claudia.JS CLI (npm install -g claudia)
  • AWS credentials configured for claudia as outlined in this tutorial

For OCRing with Google Cloud Vision you'll need:

Deployment

As mentioned above, make sure your AWS credentials are configured as outlined in this tutorial. Once you have that done, proceed to your first deployment of the API:

claudia create --region us-east-1 \
               --api-module app \
               --timeout 60 \
               --memory 512 \
               --set-env-from-json config.json

At the end of claudia create you'll get an url, to test it run:

API="https://YOUR_API_ID.execute-api.us-east-1.amazonaws.com/latest/chamber-of-deputies/receipt"
# One liner if you have `jq` installed
API="https://$(jq -r '.api.id' claudia.json).execute-api.us-east-1.amazonaws.com/latest/chamber-of-deputies/receipt"

# OCR a receipt and get the full text of the PDF
curl "${API}/1789/2015/5631380" > 5631380.json

# Play with the data
jq '.config + .extra' 5631380.json
jq '.ocrResponse.fullTextAnnotation.text' 5631380.json

Documentation

🚧 Proper documentation is in the works 🚧

  • From a high level, this is what gets done under the hood:
    • The receipt PDF associated with the reimbursement is downloaded from the Chamber of Deputies website.
    • ImageMagick is used to convert the PDF to a PNG image with: convert -density <density> receipt.pdf -quality 100 -deskew 40% -append receipt.png
    • The PNG is uploaded to Google Cloud Vision and the results are sent back to the client.
  • For custom parameters supported by the API, see app.js for now.
  • For local execution, see local.js for now (run with node local.js).
  • Example responses at examples/
  • Some useful utilities at Deskfile
  • More info? Please read the code for now, it is super small:

Wanna help?

See the issue tracker for inspiration.

Troubleshooting

Feel free to create an issue.

Function times out

Maybe the document is too big for your function to handle so give it more :muscle:

claudia update --timeout 90 --memory 1024

About

A Serverless API for OCRing Serenata de Amor's documents (currently limited to Chamber of Deputies receipts)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published