Skip to content

Use computers to extract structured data from PDFs that were originally generated with structured data.

License

Notifications You must be signed in to change notification settings

matthinz/fed-pay-stub-extractor

Repository files navigation

fed-pay-stub-extractor

This repo contains a small utility for extracting structured data from pay stub PDFs generated by Employee Express.

Requirements

  • Node.js (see .nvmrc for recommended version)
  • Yarn

Getting started

First of all, you need to go to Employee Express and download PDFs of your pay stubs. If you have a lot, this will take you a while.

Then, install the dependencies and build the project:

$ yarn && yarn build

Then run the script using yarn start, passing in paths to PDF files:

$ yarn --silent start path/to/your/pay-stub.pdf

You can pass multiple PDF files in this way, just add them to the command line:

$ yarn --silent start path/to/your/pay-stubs/*.pdf

By default, the output is in CSV, suitable for copying and pasting into an actual spreadsheet. You can also get JSON if you want:

$ yarn --silent start path/to/your/pay-stub.pdf --json

How it works

First, this tool uses pdf2json to extract text tokens from the PDF file. Then it uses some bespoke and extremely fragile parsing logic to extract structured information.

FAQ

Why don't you just download CSV files from Employee Express?

I couldn't figure out how.

This is a bad idea. How do I know the numbers this thing generates match reality?

fed-pay-stub-extractor attempts to sum up all your deductions, subtract them from your gross pay, and check that the net pay it calculates matches what it found in the PDF. This hopefully gives you a little confidence? I don't know, man. Don't use this.

You should simply use a large language model to extract all this information.

That's not a question. Also, I found in my testing that local LLMs weren't quite good enough to get "correct" data out of these PDFs and I didn't really feel like turning my pay stubs into training data for API-based models.

I got a error. It says "calculated net differs from statement net"

There are a couple of things this could be:

  1. Your pay stub might have fields on it that mine doesn't. This means that fed-pay-stub-extractor is probably not parsing out all of your deductions.
  2. The stated "net" numbers on your pay stub might include factors that are not documented elsewhere on your pay stub. This can happen if an HR snafu leads to your pay stubs being incorrect for some reason. Double check the numbers and :fingers_crossed: everything just works out.

About

Use computers to extract structured data from PDFs that were originally generated with structured data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published