A Python utility for comparing expected text against OCR-extracted text from images. Perfect for verifying OCR outputs, debugging text recognition, or automatically validating image and text file pairs.
- Image Preprocessing: Automatic grayscale conversion, contrast enhancement, and binarization for improved OCR accuracy
- Tesseract Integration: Leverages Tesseract OCR for robust text extraction
- Smart Comparison: Normalizes, tokenizes, and compares text with multiple similarity metrics
- Detailed Analytics: Computes character ratio, partial ratio, token set ratio, and more
- Word-Level Diff: Shows precise differences and fuzzy correction suggestions
- Watch Mode: Continuously monitors directories for new image/text pairs
- Flexible Output: Exit on mismatch or continue logging for batch processing
- Python 3.7+ (tested with Python 3.9–3.13)
- Tesseract OCR Engine
1. Install Python Dependencies
pip install pillow pytesseract rapidfuzz2. Install Tesseract OCR
Windows:
- Download from Tesseract-OCR
- Add Tesseract to your system PATH, or configure
pytesseract.pytesseract.tesseract_cmd
Linux (Ubuntu/Debian):
sudo apt update
sudo apt install tesseract-ocrmacOS:
brew install tesseract3. Clone Repository (Optional)
git clone https://github.com/Lavish-code/OCR-Validator .git
cd OCR-Validator Validate a single image against expected text:
python ocr_validation.py --image path/to/image.png --text "Expected Text Here"Common Options:
| Option | Shorthand | Default | Description |
|---|---|---|---|
--threshold |
-th |
80 | Minimum similarity score (0-100) |
--lang |
eng | Tesseract language code | |
--psm |
6 | Page segmentation mode | |
--oem |
3 | OCR engine mode | |
--no-preprocess |
False | Skip image preprocessing | |
--debug |
False | Enable debug output |
Monitor a directory for new images and automatically validate them:
python ocr_validation.py --watch --watch-dir path/to/watch_folderWatch Mode Options:
| Option | Default | Description |
|---|---|---|
--image-glob |
.png,.jpg,*.jpeg | Image file patterns (comma-separated) |
--text-exts |
.txt,.caption,.json | Sidecar file extensions |
--json-key |
text | JSON field containing text |
--interval |
1.0 | Polling interval (seconds) |
--fail-on-mismatch |
False | Exit immediately on first mismatch |
python ocr_validation.py -i screenshots/output1.png -t "Hello, world!"Output:
[RESULT] Similarity Metrics (0-100):
- char_ratio: 92
- partial_ratio: 94
- token_sort_ratio: 88
- token_set_ratio: 89
✅ Text & Image look consistent.
python ocr_validation.py -i sample.png -t "Nike Air Shoes"Output:
[RESULT] Similarity Metrics (0-100):
- char_ratio: 65
❌ Potential mismatch detected!
[DIFFERENCES]
- REPLACE | Expected: 'nike' | Found: 'nikee'
- DELETE | Expected: 'air' | Found: ''
[SUGGESTIONS]
- 'air' → 'ar' (score: 80)
Monitor a folder with image/text pairs:
python ocr_validation.py --watch --watch-dir outputs --fail-on-mismatchExpected structure:
outputs/
├── img1.png
├── img1.txt
├── img2.png
└── img2.json
OCR-Validator /
│
├── ocr_validation.py # Main validation script
├── README.md # Documentation
└── tests/ # Test files (optional)
- Clean images: Use
--no-preprocessif your images are already optimized - Poor quality: Keep preprocessing enabled for scanned or low-quality images
- Strict matching: Use threshold ≥ 90 for critical applications
- Fuzzy matching: Use threshold 70-80 for more lenient validation
- Very loose: Use threshold < 70 for experimental setups
For nested JSON structures, use --json-key to specify the field:
python ocr_validation.py -i image.png --text-exts .json --json-key data.descriptionUse watch mode with logging for unattended batch validation:
python ocr_validation.py --watch --watch-dir batch_folder > validation.log 2>&1Contributions are welcome! Here are some areas for improvement:
- Add unit tests for normalization and fuzzy matching
- Support additional OCR engines (EasyOCR, PaddleOCR)
- Export results in CSV/JSON/HTML formats
- Multi-language support and custom dictionaries
- GUI interface for easier operation
- Parallel processing for batch operations
To contribute:
- Fork the repository
- Create a feature branch
- Submit a pull request with tests
This project is open source. Check the repository for license details.
Issue: TesseractNotFoundError
- Solution: Ensure Tesseract is installed and in your PATH
Issue: Low accuracy scores
- Solution: Try adjusting
--psmvalues (3 for fully automatic, 6 for uniform block)
Issue: Slow processing
- Solution: Use
--no-preprocessor increase--intervalin watch mode
For issues, questions, or feature requests, please visit the GitHub repository.
**Made with ❤️ for the OCR communit