Pabst is named after the legendary American beer 'Pabst Blue Ribbon', because PBR = Parsing Bank Records
- uses the Dash framework
- pdf to text(
exp_text_parser.py
) is currently better than ocr, used as default bymain_parser.py
- Put all pdf files inside the
./parsing
directory to convert to csv. The csv files are output in./data
/data
holds input csv data
For Windows, first install miniconda (follow setup to create env also) and then install dash with conda:
conda install -c conda-forge pandas dash dash-html-components dash-core-components
Other:
pip install -r requirements.txt
pip install pandas
Running:
activate [name-of-env] # switch to conda env
python app.py # Full main app
Tips: use Python 3, make sure if you have 64-bit python, install 64-bit dependency versions. Similarly, 32-bit dep versions for 32-bit python.
source venv/bin/activate # Remember to activate your virtualenv
pip install -r requirements.txt
Install imagemagick and libimagemagickdev (differs by platform). Available via apt on Ubuntu
Install tesseract (see above.)
- Install Docker (and Virtualbox if applicable). May need to install Docker Legacy and Docker Toolbox for older machines.
- Run
./setup_pabst
to set up the docker container. Warning, this will take a while. - Run
./pabst <FILENAME>
to run ocr parsing on a file.