This repo contains a small utility for extracting structured data from pay stub PDFs generated by Employee Express.
- Node.js (see
.nvmrc
for recommended version) - Yarn
First of all, you need to go to Employee Express and download PDFs of your pay stubs. If you have a lot, this will take you a while.
Then, install the dependencies and build the project:
$ yarn && yarn build
Then run the script using yarn start
, passing in paths to PDF files:
$ yarn --silent start path/to/your/pay-stub.pdf
You can pass multiple PDF files in this way, just add them to the command line:
$ yarn --silent start path/to/your/pay-stubs/*.pdf
By default, the output is in CSV, suitable for copying and pasting into an actual spreadsheet. You can also get JSON if you want:
$ yarn --silent start path/to/your/pay-stub.pdf --json
First, this tool uses pdf2json
to extract text tokens from the PDF file. Then it uses some bespoke and extremely fragile parsing logic to extract structured information.
I couldn't figure out how.
fed-pay-stub-extractor attempts to sum up all your deductions, subtract them from your gross pay, and check that the net pay it calculates matches what it found in the PDF. This hopefully gives you a little confidence? I don't know, man. Don't use this.
That's not a question. Also, I found in my testing that local LLMs weren't quite good enough to get "correct" data out of these PDFs and I didn't really feel like turning my pay stubs into training data for API-based models.
There are a couple of things this could be:
- Your pay stub might have fields on it that mine doesn't. This means that fed-pay-stub-extractor is probably not parsing out all of your deductions.
- The stated "net" numbers on your pay stub might include factors that are not documented elsewhere on your pay stub. This can happen if an HR snafu leads to your pay stubs being incorrect for some reason. Double check the numbers and :fingers_crossed: everything just works out.