Mini-IronRipper

Mini-IronRipper is a Python application for extracting text and metadata from PDFs, Word docs, Excel files, CSVs, images, and emails. It uses OCR for handling scanned documents and images.

Features

Extract text from PDFs using pdfminer and OCR
Convert Word and PowerPoint docs to PDF for processing
Extract text, metadata, and sheet data from Excel files
Read CSV files into Pandas DataFrames
Perform OCR on images to extract text
Extract text and metadata from Outlook MSG files
Multithreading support for faster processing
Save extracted info to CSV and text files

Usage

Mini-IronRipper can be run as a CLI application:

python mini-iron-ripper.py <directory> <question> <threads>

<directory> - Path to the folder containing files for processing
<question> - Question to ask about the documents
<threads> - Number of threads to use for parallel processing

This will extract text and metadata from supported files in the given folder. The results will be saved to CSV and text files.

The text file contains the extracted text formatted for asking a question to an AI assistant.

Requirements

Python 3.6+
Requirements listed in requirements.txt:
- pdfminer
- pytesseract
- pandas
- openpyxl
- extract_msg
- pdf2image
- Pillow

Docker Container

A Dockerfile is provided to build an image with the required dependencies.

Build:

docker build -t mini-iron-ripper .

Run:

docker run -it --rm -v $(pwd):/app mini-iron-ripper <args>

Credits

Mini-IronRipper was created by Josh Yorko It uses the following open source libraries:

pdfminer for PDF parsing
pytesseract for OCR
Pandas for data processing
OpenPyXL for Excel files
extract_msg for Outlook MSG files

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github		.github
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
mini-iron-ripper.py		mini-iron-ripper.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini-IronRipper

Features

Usage

Requirements

Docker Container

Credits

About

Releases

Packages

Contributors 2

Languages

joshyorko/Mini-IronRipper

Folders and files

Latest commit

History

Repository files navigation

Mini-IronRipper

Features

Usage

Requirements

Docker Container

Credits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages