FULL OCR PREPROCESSING TOOL

This tool for full pipeline from pdf files to prepared for Machine learning data from files

Pipeline

Classification problem:

We have set of pairs (pdf, target) For example (receipt.pdf, [1 if fraud else 0])

Pipeline for that problem:

pdf documents -> images (using module pdf_to_png)
png files -> ocr tool to recognize text from png (using image_to_txt)
text from png files -> glove vectors (using text_to_vector module)
collect for all pairs (pdf, target) pairs like (vectors, target)
now we have numerical data -> train your model to classify fraud!

Installation

download this repo as zip or using git clone
In directory of ocr_pipe run pip install .

Usage

Prepare your dataset in csv like (or use labeled folders):

files.csv

pdf, target
data_dir\good_ticket1.pdf, 0
data_dir\bad_ticket1.pdf, 1
data_dir\bad_ticket2.pdf, 1
....

Right script

import ocr_pipe

preprocessor = ocr_pipe.PDFToVectors(words_num = 50)
processed_frame = ocr_pipe.run(files.csv, preprocessor)
print(processed_frame.head())

# pdf, target, partition, words, features (word embeddings)
# ticket1.pdf, 0, 0, "first page text..", [0.4, 0.5, 0.7 ...]
# ticket1.pdf, 0, 1, "second page text..", [0.4, 0.6, 0.9 ...]
# ticket2.pdf, 1, 0, "fraud text in another document", [0.4, 0.6, 0.9 ...]

X = processed_frame.features.values
print(X.shape) # for length of embeddings 25 and 50 words it will be [dataset size, 50, 25]

y = processed_frame.target.values
print(y.shape) # same length as datset size

X_train, X_test, y_train, y_test = ...
model.fit()
....

# Train your model!

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
ocr_pipe		ocr_pipe
tests/arxiv		tests/arxiv
.gitignore		.gitignore
README.md		README.md
install.sh		install.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FULL OCR PREPROCESSING TOOL

Pipeline

Classification problem:

Installation

Usage

About

Releases

Packages

Languages

sokolegg/ocr_project

Folders and files

Latest commit

History

Repository files navigation

FULL OCR PREPROCESSING TOOL

Pipeline

Classification problem:

Installation

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages