SMART Extractor

Extracts data from GSF PDFs for SMART facilitators. Data is extracted using Tabula (with its python bindings), then cleaned up. Since PDF data extraction is not perfect, the script collections emails (strings containing '@') and outputs the set of emails for which no entry was extracted; these will have to be extracted by hand.

Dependencies

python (tested in 3.6)
java (7 or 8)
tabula-py

How to install dependencies

Download and install python and Java, if you don't have them
(optional) Install virtualenv and configure an environment
1. pip install virtualenv
2. virtualenv pdf
3. pdf\Scripts\activate (use deactivate to exit the virtualenv)
Install tabula-py: pip install tabula-py NOTE: tabula-py relies on pandas; it is strongly recommended that you install this within a virtualenv

How to use this script

Basic usage

python smart_extractor.py input.pdf

This will extract data from input.pdf, saving the extracted data in output.csv and the missed emails in missed.csv. If output.csv or missed.csv exist, you can supply different output names with -o and -m, or use -f to overwrite existing files.

Batch convert

python smart_extractor.py input1.pdf path/to/input2.pdf directory/of/pdfs/

You can supply multiple PDF files, as well as directories containing PDFs. The results are collected and written to output.csv and missed.csv.

All options

usage: smart_extractor.py [-h] [-o OUTPUT] [-m MISSED] [-f] [-q]
                          input [input ...]

Extract GSF data from PDF. Good entries are saved in output.csv; entries that
couldn't be procssed are saved in output-missed.csv.

positional arguments:
  input                 a PDF file, or directory of PDF files, to process

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        filename for extracted entries
  -m MISSED, --missed MISSED
                        filename for emailed not found in extracted entries
  -f, --force           force overwrite of existing output files
  -q, --quiet           don't print processing info to the command line

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE.txt		LICENSE.txt
README.md		README.md
smart_extractor.py		smart_extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMART Extractor

Dependencies

How to install dependencies

How to use this script

Basic usage

Batch convert

All options

About

Releases

Packages

Languages

License

saites/smart_extractor

Folders and files

Latest commit

History

Repository files navigation

SMART Extractor

Dependencies

How to install dependencies

How to use this script

Basic usage

Batch convert

All options

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages