Intelligent OCR and Text Analysis Tool

🎯 Status: PRODUCTION READY | Performance: 16.7x Faster | All OCR Engines: ✅ Working

🚀 Performance Highlights

⚡ 16.7x faster than baseline with batch processing
🧠 Intelligent caching system for repeated operations
🔄 Real-time progress tracking with ETA calculations
💻 Multi-core processing utilizing all available CPU cores
🎯 99%+ accuracy with multiple OCR engine support

Description

An advanced application that performs Optical Character Recognition (OCR) on images and PDFs, extracts text with layout preservation, and provides a question-answering interface based on the extracted content. It leverages machine learning models, state-of-the-art OCR engines, and modern NLP techniques to enable users to interactively query their documents.

Features

Multiple OCR Engines: Choose between PaddleOCR, EasyOCR, Tesseract, Dolphin, or a combined approach for optimal results
Layout Preservation: Maintains the original document formatting, including line breaks and text positioning
Image Preprocessing: Automatically enhances images for better OCR accuracy
Table Detection: Identifies table structures in documents
Format Output Options: Download extracted text in various formats (TXT, JSON, Markdown)
Interactive Q&A: Ask questions about the extracted text using the RAG (Retrieval-Augmented Generation) system
Multi-page PDF Support: Process multi-page PDFs with progress tracking
Modern UI/UX: Enhanced user interface with custom styling and interactive elements
Robust Design: Gracefully handles missing dependencies with fallbacks
Modular Architecture: Well-organized code structure for easy maintenance and extension

Installation

Prerequisites

Python 3.8+ recommended
Pip package manager
Optional: Tesseract OCR engine installed on your system (for fallback OCR)

Basic Installation

Clone the repository:

git clone https://github.com/Rayyan9477/OCR-Image-to-text.git
cd OCR-Image-to-text

Install the required packages:
```
pip install -r requirements.txt
```

NEW: Automated Tesseract Installation (Windows):

# Install Tesseract automatically using winget
winget install UB-Mannheim.TesseractOCR

For other platforms, install system dependencies:

For macOS:

brew install tesseract

For Linux:

sudo apt-get update
sudo apt-get install -y tesseract-ocr

Verify your installation:

python cli_app.py --check

For Linux:

sudo apt-get update
sudo apt-get install -y tesseract-ocr

Check your installation:
```
python run.py --check
```

Optimizing Installation

The system can work with just one OCR engine, but for best results, install multiple engines:

For best accuracy: Install PaddleOCR AND EasyOCR
For lightweight usage: Install only PyTesseract
For offline usage: Install PyTesseract (no internet required)

Project Structure

The project follows a modular architecture for better maintainability and extensibility:

ocr_app/                  # Main package
├── __init__.py           # Package initialization
├── ocr_app.py            # Main application entry point
├── streamlit_app.py      # Streamlit application launcher
├── config/               # Configuration management
│   ├── __init__.py
│   ├── config.json       # Default configuration
│   └── settings.py       # Settings and configuration
├── core/                 # Core OCR functionality
│   ├── __init__.py
│   ├── ocr_engine.py     # Main OCR engine implementation
│   └── image_processor.py # Image preprocessing utilities
├── models/               # ML model management
│   ├── __init__.py
│   └── model_manager.py  # Model loading and caching
├── rag/                  # Question-answering functionality
│   ├── __init__.py
│   └── rag_processor.py  # RAG implementation
├── ui/                   # User interfaces
│   ├── __init__.py
│   ├── web_app.py        # Streamlit web interface
│   └── cli.py            # Command-line interface
└── utils/                # Utility functions
    ├── __init__.py
    └── text_utils.py     # Text processing utilities

Usage

The application provides multiple ways to interact with it:

Web Interface (Recommended)

Start the web application:

python run.py

or

python -m ocr_app.streamlit_app

Open your browser to the displayed URL (typically http://localhost:8501)
Use the intuitive interface to:
- Upload images or PDFs
- Configure OCR options
- Process and extract text
- Ask questions about the extracted content

Command Line Interface

For batch processing or integration with other tools:

Extract text from an image:

python run.py --cli extract --image path/to/image.jpg --output result.txt

Analyze an image and extract information:

python run.py --cli analyze --image path/to/image.jpg --format json

Ask a question about an image:

python run.py --cli question --image path/to/image.jpg --query "What is the date mentioned?"

Process a batch of files:

python run.py --cli --batch path/to/folder --output results.json --format json

Get help and see all available options:
```
python run.py --cli --help
```

Run CLI with Dolphin model

python run_ocr.py --cli --engine dolphin --input path/to/image.jpg --output result.txt

Python API

You can also use the components programmatically in your Python code:

from ocr_app.core.ocr_engine import OCREngine
from ocr_app.config.settings import Settings
from PIL import Image

# Initialize components
settings = Settings()
ocr_engine = OCREngine(settings)

# Process an image
image = Image.open("path/to/image.jpg")
text = ocr_engine.perform_ocr(
    image, 
    engine="combined",  # "auto", "tesseract", "easyocr", "paddleocr", or "combined"
    preserve_layout=True,
    preprocess=True
)

# Use the extracted text
print(text)

For Q&A functionality:

from ocr_app.core.ocr_engine import OCREngine
from ocr_app.rag.rag_processor import RAGProcessor
from ocr_app.models.model_manager import ModelManager
from ocr_app.config.settings import Settings
from PIL import Image

# Initialize components
settings = Settings()
model_manager = ModelManager(settings)
ocr_engine = OCREngine(settings)
rag_processor = RAGProcessor(model_manager, settings)

# Process an image and ask a question
image = Image.open("path/to/image.jpg")
text = ocr_engine.perform_ocr(image)
answer = rag_processor.process_query(text, "What dates are mentioned in the text?")

print(f"Answer: {answer['answer']}")
print(f"Confidence: {answer['confidence']}")

├── __init__.py
└── text_utils.py     # Text processing utilities


## Usage

The application can be run in multiple modes:

### Web Interface Mode (Default)

The easiest way to use the application with a full graphical interface:

python run.py


or explicitly:

python run.py --web


### Command-Line Interface

Process files directly from the command line:

python run.py --cli --input image.jpg --output results.txt


Process multiple files in a directory:

python run.py --cli --batch ./images/ --output ./results/


Support for different output formats:

python run.py --cli --input document.pdf --format json


### Check Mode

Verify your OCR functionality and available engines:

python run.py --check


## OCR Engine Comparison

- **PaddleOCR**: Fast and accurate, particularly good for structured documents and Asian languages
- **EasyOCR**: Good all-around OCR with support for 80+ languages
- **Combined Mode**: Uses multiple engines and selects the best result for optimal accuracy
- **Tesseract**: Great for offline usage, no internet required, but less accurate on complex layouts

## Advanced Usage

### Using the OCR Module in Your Code

```python
from ocr_app.core.ocr_engine import OCREngine
from ocr_app.config.settings import Settings
from PIL import Image

# Initialize OCR engine
settings = Settings()
ocr_engine = OCREngine(settings)

# Open an image
image = Image.open("document.jpg")

# Perform OCR with layout preservation
text = ocr_engine.perform_ocr(image, engine="auto", preserve_layout=True)
print(text)

Processing PDF Documents

import fitz  # PyMuPDF
from ocr_app.core.ocr_engine import OCREngine
from ocr_app.config.settings import Settings
from PIL import Image

# Open PDF
settings = Settings()
ocr_engine = OCREngine(settings)

doc = fitz.open("document.pdf")
for page in doc:
    pix = page.get_pixmap()
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    text = ocr_engine.perform_ocr(img, engine="combined", preserve_layout=True)
    print(text)

Question-Answering with Documents

from ocr_app.core.ocr_engine import OCREngine
from ocr_app.rag.rag_processor import RAGProcessor
from ocr_app.models.model_manager import ModelManager
from ocr_app.config.settings import Settings
from PIL import Image

# Initialize components
settings = Settings()
model_manager = ModelManager(settings)
ocr_engine = OCREngine(settings)
rag_processor = RAGProcessor(model_manager, settings)

# Extract text from image
image = Image.open("document.jpg")
text = ocr_engine.perform_ocr(image)

# Ask a question about the document
question = "What is the main topic of this document?"
answer = rag_processor.process_query(text, question)
print(f"Question: {question}")
print(f"Answer: {answer['answer']}")
print(f"Confidence: {answer['confidence']}")

Command-Line Options

usage: run.py [-h] [--web] [--cli] [--check] ...

OCR Image-to-Text Application

Mode Selection:
  --web, -w           Run in web interface mode (default)
  --cli, -c           Run in command-line interface mode
  --check             Check available OCR engines and dependencies

CLI Mode Options:
  --input INPUT, -i INPUT
                      Path to input image or PDF file
  --output OUTPUT, -o OUTPUT
                      Path to output file
  --engine {auto,tesseract,easyocr,paddleocr,combined}
                      OCR engine to use
  --no-layout         Disable layout preservation
  --format {txt,json,md}
                      Output format (txt, json, or md)
  --batch BATCH, -b BATCH
                      Process all files in a directory
  --verbose, -v       Enable verbose logging

Troubleshooting

Common Issues

Missing Dependencies: If you encounter import errors, run python run.py --check to check which dependencies are missing.
OCR Engine Not Found: The system will fall back to alternative engines if your primary choice isn't available.
TensorFlow/Keras Compatibility: The application handles TensorFlow/Keras compatibility issues automatically, but you might need to set environment variables manually in some environments:
```
$env:TF_CPP_MIN_LOG_LEVEL = "2"
$env:TF_USE_LEGACY_KERAS = "1"
$env:KERAS_BACKEND = "tensorflow"
```
Tesseract Not Found: Make sure Tesseract is installed and properly added to your system PATH.

Developer Guide

Adding a New OCR Engine

Create a new engine class that inherits from BaseOCREngine in ocr_app/core/ocr_engine.py:

class MyNewOCREngine(BaseOCREngine):
    def __init__(self, settings):
        super().__init__(settings)
        # Initialize your OCR engine
        
    def extract_text(self, image, preserve_layout=True):
        # Implement OCR logic
        return extracted_text

Add engine detection in the OCREngine._check_engines method:

def _check_engines(self):
    engines = {
        # Existing engines
        "my_new_engine": False
    }
    
    # Check for your engine
    try:
        # Check if your OCR engine is available
        engines["my_new_engine"] = True
    except ImportError:
        pass
        
    return engines

Register the engine in OCREngine._initialize_engines:

if self.available_engines.get("my_new_engine", False):
    try:
        self.engines["my_new_engine"] = MyNewOCREngine(self.settings)
    except Exception as e:
        logger.error(f"Failed to initialize MyNewOCR engine: {e}")

Customizing Settings

You can create a custom configuration file at ocr_app/config/config.json:

{
  "ocr": {
    "engines": {
      "tesseract": {
        "enabled": true,
        "cmd_path": "C:\\Program Files\\Tesseract-OCR\\tesseract.exe"
      },
      "easyocr": {
        "enabled": true,
        "gpu": false
      }
    },
    "default_engine": "tesseract",
    "preserve_layout": true
  },
  "models": {
    "download_path": "./custom_models",
    "qa_model": "distilbert-base-cased-distilled-squad"
  }
}

Technologies Used

Streamlit: For building the interactive web application
PyMuPDF (fitz): For improved PDF handling and processing
Pillow (PIL): For image processing and manipulation
EasyOCR: Neural network-based OCR engine
PaddleOCR: State-of-the-art OCR system with high accuracy
OpenCV: For advanced image preprocessing and layout analysis
Pytesseract: Tesseract OCR Python wrapper
Transformers: HuggingFace library for loaded pre-trained models
SentenceTransformers: For generating sentence embeddings
FAISS: Facebook AI Similarity Search for efficient similarity search
PyTorch: Deep learning framework underpinning the ML models

Contact

For inquiries or feedback:

Email: rayyanahmed265@yahoo.com
LinkedIn: Rayyan Ahmed
GitHub: Rayyan9477

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
layout		layout
models		models
ocr_app		ocr_app
ocr_core		ocr_core
src		src
static		static
tests		tests
ui		ui
.gitignore		.gitignore
Dockerfile		Dockerfile
ENHANCED_OCR_COMPLETION_REPORT.md		ENHANCED_OCR_COMPLETION_REPORT.md
FINAL_COMPLETION_REPORT.md		FINAL_COMPLETION_REPORT.md
OPTIMIZATION_SUMMARY.md		OPTIMIZATION_SUMMARY.md
PRECISION_LAYOUT_COMPLETION.md		PRECISION_LAYOUT_COMPLETION.md
PROJECT_COMPLETION_REPORT.md		PROJECT_COMPLETION_REPORT.md
README.md		README.md
app.py		app.py
complex_test_image.jpg		complex_test_image.jpg
packages.txt		packages.txt
precision_layout_output.html		precision_layout_output.html
precision_layout_output.md		precision_layout_output.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
simple_test_image.jpg		simple_test_image.jpg
test_precision_layout.py		test_precision_layout.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Intelligent OCR and Text Analysis Tool

🚀 Performance Highlights

Description

Features

Installation

Prerequisites

Basic Installation

Optimizing Installation

Project Structure

Usage

Web Interface (Recommended)

Command Line Interface

Python API

Processing PDF Documents

Question-Answering with Documents

Command-Line Options

Troubleshooting

Common Issues

Developer Guide

Adding a New OCR Engine

Customizing Settings

Technologies Used

Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Rayyan9477/OCR-Image-to-text

Folders and files

Latest commit

History

Repository files navigation

Intelligent OCR and Text Analysis Tool

🚀 Performance Highlights

Description

Features

Installation

Prerequisites

Basic Installation

Optimizing Installation

Project Structure

Usage

Web Interface (Recommended)

Command Line Interface

Python API

Processing PDF Documents

Question-Answering with Documents

Command-Line Options

Troubleshooting

Common Issues

Developer Guide

Adding a New OCR Engine

Customizing Settings

Technologies Used

Contact

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages