This repository provides a processing pipeline for assessing OCR quality in digitized newspaper collections within the Impresso project ecosystem. It uses Bloom filters and statistical methods to evaluate text recognition accuracy and generate comprehensive quality metrics.
- Overview
- Quick Start
- Configuration
- Build System
- Quality Assessment Methods
- Advanced Usage
- Contributing
- About Impresso
This pipeline provides a complete framework for OCR quality assessment that:
- Evaluates OCR Accuracy: Uses Bloom filter-based lexical matching to assess text recognition quality
- Supports Multiple Languages: Process multilingual collections with language-specific dictionaries
- Scales Horizontally: Process data across multiple machines without conflicts
- Handles Large Datasets: Efficiently process large collections using S3 and local stamp files
- Maintains Consistency: Ensure reproducible results with proper dependency management
- Integrates with S3: Seamlessly work with both local files and S3 storage
├── README.md # This file
├── Makefile # Main build configuration
├── .env # Environment variables (to be created manually from dotenv.sample)
├── dotenv.sample # Sample environment configuration
├── Pipfile # Python dependencies
├── lib/
│ └── ocrqa_bloom.py # OCR quality assessment script
├── cookbook/ # Build system components
│ ├── README.md # Detailed cookbook documentation
│ ├── setup_ocrqa.mk # OCR QA-specific setup
│ ├── paths_ocrqa.mk # Path definitions
│ ├── sync_ocrqa.mk # Data synchronization
│ ├── processing_ocrqa.mk # Processing targets
│ └── ... # Other cookbook components
├── cookbook-repo-addons/ # Repository-specific extensions
│ ├── config-lb-unknowns.mk # Luxembourgish unknown words config
│ └── ... # Other config files
├── configs/ # Version-specific configurations
│ ├── config-ocrqa-ocrqa-wp_v1.0.6_v1-0-0.mk
│ └── config-ocrqa-ocrqa-wp_v1.0.6_v1-0-1.mk
└── build.d/ # Local build directory (auto-created)
Note on Versioning: The version of this repository should reflect the latest configuration found in the
configs/directory. This ensures alignment between the codebase and the processing configurations used for production runs.
Follow these steps to get started with OCR quality assessment:
Ensure you have the required system dependencies installed:
- Python 3.11+
- Make (GNU Make recommended)
- Git with Git LFS
- GNU Parallel
- jq (for aggregations)
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y make git git-lfs parallel coreutils python3 python3-pip jqmacOS:
# Install Homebrew if not already installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install dependencies
brew install make git git-lfs parallel coreutils python3 jq-
Clone the repository:
git clone --recursive https://github.com/impresso/impresso-ocrqa-unigram-cookbook.git cd impresso-ocrqa-unigram-cookbook -
Install system-level dependencies:
# Ubuntu or Debian bash cookbook/install_apt.sh # macOS (with Homebrew) bash cookbook/install_brew.sh
-
Configure environment:
Before running any processing, configure your environment (see Configuration):
cp dotenv.sample .env # Edit .env with your S3 credentials and settings -
Install Python dependencies:
# Using pipenv (recommended) export PIPENV_VENV_IN_PROJECT=enabled pipenv install
-
Initialize the environment:
pipenv shell make setup
The following steps assume that you have activated the pipenv shell.
Process a small newspaper to verify everything works:
# Test with a smaller newspaper first
make newspaper NEWSPAPER=actionfem# Process entire collection
make collectionYou can also run individual steps:
-
Sync data:
make sync NEWSPAPER=actionfem
-
Run processing:
make processing-target NEWSPAPER=actionfem
-
Upload results:
make sync-output NEWSPAPER=actionfem
Edit your .env file with these required settings:
# S3 Configuration (required)
SE_ACCESS_KEY=your_s3_access_key
SE_SECRET_KEY=your_s3_secret_key
SE_HOST_URL=https://os.zhdk.cloud.switch.ch/
# Logging Configuration (optional)
LOGGING_LEVEL=INFOOr provide these variables in your shell environment by other means.
These can be set in .env or passed as command arguments:
NEWSPAPER: Target newspaper to processBUILD_DIR: Local build directory (default:build.d)PARALLEL_JOBS: Maximum number of parallel years of a newspaper to processCOLLECTION_JOBS: Number of newspaper titles to be run in parallelNEWSPAPER_YEAR_SORTING: Processing order of years (shuffor random,catfor chronological)
Configure S3 buckets in your paths file:
S3_BUCKET_REBUILT: Input data bucket (default:22-rebuilt-final)S3_BUCKET_OCRQA: Output data bucket (default:140-processed-data-sandbox)
After installation, these are the main commands you'll use:
make help: Show available targets and current configurationmake setup: Initialize environment (run once after installation)make newspaper: Process single newspapermake collection: Process multiple newspapers in parallelmake all: Complete processing pipeline with data sync
make sync: Sync input and output datamake sync-input: Download input data from S3make sync-output: Upload results to S3 (will never overwrite existing data)make clean-build: Remove build directory
The system automatically detects CPU cores and configures parallel processing:
# Process collection with custom parallelization
make collection COLLECTION_JOBS=4 MAX_LOAD=8The build system uses:
- Stamp Files: Track processing state without downloading full datasets
- S3 Integration: Direct processing from/to S3 storage
- Distributed Processing: Multiple machines can work independently
- Dependency Management: Automatic dependency resolution via Make
For detailed build system documentation, see cookbook/README.md.
The OCR quality assessment script supports multiple evaluation methods:
Calculates the ratio of known unique subtoken types to all unique subtoken types. This provides a measure of how many unique words in the text are recognized by the Bloom filter, serving as an indicator of OCR quality.
Measures the overall ratio of unknown tokens to total tokens in the document.
The system uses Bloom filters for efficient lexical matching across multiple languages:
- Support for multiple language-specific dictionaries
- Efficient memory usage for large lexicons
- Fast lookup operations
- Configurable via Hugging Face Hub references or local files
To run the OCR quality assessment script, use the following command:
To run the OCR quality assessment script directly, use the following command:
```bash
python lib/ocrqa_bloom.py \
--input input.jsonl \
--bloomdicts bloom1.bloom bloom2.bloom \
--languages en fr \
--methods slc unk_ratio \
--output results.jsonl \
--lid langident.json--input: Input JSONL files (default: stdin)--output: Output file (default: stdout)--log-file FILE: Write log to FILE
-l, --languages: Language ISO 2-letter codes (must match the sequence of provided bloom dictionaries)--bloomdicts: Paths to JSON files containing bloom dictionaries or Hugging Face Hub references (e.g.,hf://model_id/bloom.bloom)--lid: Path to language identification file
--methods: OCR QA methods to use (default:unk_type_ratio)- Available:
unk_type_ratio,unk_ratio
- Available:
--keep-best: Keep only the highest OCR value for a given content item using the first method in--methods--unicode-normalization: Unicode normalization form to apply to input text (default: NFKC)
-C, --single_letter_cost: Cost for an infrequent single char (default: 0.7)-S, --single_symbol_cost: Cost for an infrequent symbol char (default: 0.3)
--log-level: Logging level (default: INFO)-q, --quiet: Do not print status messages to stderr-v, --verbose-output: Print verbose output information
--s3-output-path: S3 path to upload the output file after processing or check if it already exists--quit-if-s3-output-exists: Quit if the output file already exists in the specified S3 bucket--keep-timestamp-only: After uploading to S3, keep only the timestamp of the local output file for data efficiency--s3-output-dry-run: Dry run which suppresses all write operations to S3 and checks whether output files exist
Basic processing with Hugging Face dictionaries:
python lib/ocrqa_bloom.py \
--input input.jsonl \
--bloomdicts hf://impresso/bloom-en hf://impresso/bloom-fr \
--languages en fr \
--methods unk_type_ratio slc \
--output results.jsonl \
--lid langident.jsonProcessing with S3 integration:
python lib/ocrqa_bloom.py \
--input input.jsonl \
--bloomdicts hf://impresso/bloom-en \
--languages en \
--s3-output-path s3://bucket/path/output.jsonl \
--quit-if-s3-output-existsTo produce extended output identifying unknown words (useful for dictionary improvement):
# For a newspaper in Luxembourgish
CONFIG_LOCAL_MAKE=cookbook-repo-addons/config-lb-unknowns.mk make all NEWSPAPER=your_newspaperThis section provides context for understanding OCR challenges in historical Luxembourgish texts.
- Indicating long or stressed vowels
- gro'ss → modern grouss
- se'er → modern seier
- Marking elision or glottalization
- ge'nt, go'f, go'w (possible sound loss or separation)
- Clarifying pronunciation in loanwords
- Unio'n, situatio'n, millio'nen
- Separating prefixes or morphemes
- ne'deg → modern néideg
- we'neg → modern wéineg
- Pre-1946: Apostrophes were common after vowels, often inconsistently
- 1946 Reform: Reduced apostrophe use, favoring phonetic spelling
- 1975 Reform: Further simplification, removing unnecessary markers
- 1999 Reform: Apostrophes after vowels were eliminated, except in contractions (e.g., d'Kanner remains, but se'er → seier)
The historical use of apostrophes after vowels served as a pronunciation guide for vowel length, stress, and borrowed words. Over time, Luxembourgish orthography standardized and simplified, leading to the apostrophe's removal in these contexts. This historical variation presents unique challenges for OCR quality assessment of historical Luxembourgish newspapers.
We welcome contributions to improve this OCR quality assessment pipeline:
- Fork the repository
- Create a feature branch (
git checkout -b feature/your-feature) - Make your changes
- Test with
make newspaper NEWSPAPER=actionfem - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/your-feature) - Submit a pull request
For any questions or issues, please contact simon.clematide@uzh.ch.
## About Impresso
### Impresso Project
### Default Method
The default method used for OCR quality assessment is `unk_type_ratio`. This method calculates the ratio of known unique subtoken types to all unique subtoken types. It provides a measure of how many unique words in the text are recognized by the Bloom filter, which can be an indicator of OCR quality.
### Example
```sh
python lib/ocrqa_bloom.py --input input.jsonl --bloomdicts hf://model_id/bloom1.bloom hf://model_id/bloom2.bloom --languages en fr --methods slc unk_ratio --output results.jsonl --lid langident.json
- Indicating long or stressed vowels
- gro’ss → modern grouss
- se’er → modern seier
- Marking elision or glottalization
- ge’nt, go’f, go’w (possible sound loss or separation)
- Clarifying pronunciation in loanwords
- Unio’n, situatio’n, millio’nen
- Separating prefixes or morphemes
- ne’deg → modern néideg
- we’neg → modern wéineg
- Pre-1946: Apostrophes were common after vowels, often inconsistently.
- 1946 Reform: Reduced apostrophe use, favoring phonetic spelling.
- 1975 Reform: Further simplification, removing unnecessary markers.
- 1999 Reform: Apostrophes after vowels were eliminated, except in contractions (e.g., d’Kanner remains, but se’er → seier).
The historical use of apostrophes after vowels served as a pronunciation guide for vowel length, stress, and borrowed words. Over time, Luxembourgish orthography standardized and simplified, leading to the apostrophe's removal in these contexts.
# for a newspaper and Luxembourgish
CONFIG_LOCAL_MAKE=cookbook-repo-addons/config-lb-unknowns.mk make allFor any questions or issues, please contact simon.clematide@uzh.ch.
Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders.
The project is funded by:
- Swiss National Science Foundation (grants CRSII5_173719 and CRSII5_213585)
- Luxembourg National Research Fund (grant 17498891)
Copyright (C) 2018-2025 The Impresso team.
Contributors to this program include: Maud Ehrmann, Simon Clematide
This program is provided as open source under the GNU Affero General Public License v3 or later.
