Skip to content

saites/smart_extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

SMART Extractor

Extracts data from GSF PDFs for SMART facilitators. Data is extracted using Tabula (with its python bindings), then cleaned up. Since PDF data extraction is not perfect, the script collections emails (strings containing '@') and outputs the set of emails for which no entry was extracted; these will have to be extracted by hand.

Dependencies

  • python (tested in 3.6)
  • java (7 or 8)
  • tabula-py

How to install dependencies

  1. Download and install python and Java, if you don't have them
  2. (optional) Install virtualenv and configure an environment
    1. pip install virtualenv
    2. virtualenv pdf
    3. pdf\Scripts\activate (use deactivate to exit the virtualenv)
  3. Install tabula-py: pip install tabula-py NOTE: tabula-py relies on pandas; it is strongly recommended that you install this within a virtualenv

How to use this script

Basic usage

python smart_extractor.py input.pdf

This will extract data from input.pdf, saving the extracted data in output.csv and the missed emails in missed.csv. If output.csv or missed.csv exist, you can supply different output names with -o and -m, or use -f to overwrite existing files.

Batch convert

python smart_extractor.py input1.pdf path/to/input2.pdf directory/of/pdfs/

You can supply multiple PDF files, as well as directories containing PDFs. The results are collected and written to output.csv and missed.csv.

All options

usage: smart_extractor.py [-h] [-o OUTPUT] [-m MISSED] [-f] [-q]
                          input [input ...]

Extract GSF data from PDF. Good entries are saved in output.csv; entries that
couldn't be procssed are saved in output-missed.csv.

positional arguments:
  input                 a PDF file, or directory of PDF files, to process

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        filename for extracted entries
  -m MISSED, --missed MISSED
                        filename for emailed not found in extracted entries
  -f, --force           force overwrite of existing output files
  -q, --quiet           don't print processing info to the command line

About

Extracts SMART data from GSF files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages