Best practice for batch OCR processing with memory optimization? #10

DefcoGit · 2026-01-01T14:26:51Z

DefcoGit
Jan 1, 2026

Hi,

I am working on processing large batches of scanned documents (500+ pages) with german-ocr and running into memory issues.

Current setup:

Python 3.11
16GB RAM
Processing PDFs with ~50 pages each

Questions:

What is the recommended batch size for optimal memory usage?
Should I use multiprocessing or async for parallel processing?
Are there any caching strategies for repeated document types (invoices, forms)?

Any best practices for production deployments would be greatly appreciated!

Thanks!

Answered by Keyvanhardani

Jan 1, 2026

Great questions! Here are my recommendations based on production experience:

1. Batch Size Optimization

For 16GB RAM, I recommend:

10-15 pages per batch for high-resolution scans (300+ DPI)
20-30 pages per batch for standard resolution (150 DPI)

from german_ocr import OCREngine
import gc

def process_in_batches(pages, batch_size=15):
    results = []
    for i in range(0, len(pages), batch_size):
        batch = pages[i:i+batch_size]
        results.extend(engine.process(batch))
        gc.collect()  # Force garbage collection
    return results

2. Multiprocessing vs Async

Use multiprocessing for CPU-bound OCR tasks:

from concurrent.futures import ProcessPoolExecutor
import os

workers = …

View full answer

Keyvanhardani · 2026-01-01T14:27:14Z

Keyvanhardani
Jan 1, 2026
Maintainer

Great questions! Here are my recommendations based on production experience:

1. Batch Size Optimization

For 16GB RAM, I recommend:

10-15 pages per batch for high-resolution scans (300+ DPI)
20-30 pages per batch for standard resolution (150 DPI)

from german_ocr import OCREngine
import gc

def process_in_batches(pages, batch_size=15):
    results = []
    for i in range(0, len(pages), batch_size):
        batch = pages[i:i+batch_size]
        results.extend(engine.process(batch))
        gc.collect()  # Force garbage collection
    return results

2. Multiprocessing vs Async

Use multiprocessing for CPU-bound OCR tasks:

from concurrent.futures import ProcessPoolExecutor
import os

workers = max(1, os.cpu_count() - 1)  # Leave one core free

with ProcessPoolExecutor(max_workers=workers) as executor:
    results = list(executor.map(process_page, pages))

Async is better for I/O-bound tasks (file loading, API calls).

3. Caching Strategy

For repeated document types, use template caching:

import hashlib
from functools import lru_cache

@lru_cache(maxsize=100)
def get_layout_template(doc_hash):
    # Cache detected layouts for similar documents
    return detect_layout(doc_hash)

def get_doc_hash(image):
    # Hash based on structure, not pixel values
    return hashlib.md5(image.tobytes()[:1000]).hexdigest()

Production Tips

Use memory-mapped files for large PDFs
Enable streaming mode if available
Monitor with psutil to prevent OOM
Consider Redis for distributed caching

Hope this helps! Let me know if you need more details.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best practice for batch OCR processing with memory optimization? #10

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Best practice for batch OCR processing with memory optimization? #10

Uh oh!

DefcoGit Jan 1, 2026

1. Batch Size Optimization

2. Multiprocessing vs Async

Replies: 1 comment

Uh oh!

Keyvanhardani Jan 1, 2026 Maintainer

1. Batch Size Optimization

2. Multiprocessing vs Async

3. Caching Strategy

Production Tips

DefcoGit
Jan 1, 2026

Keyvanhardani
Jan 1, 2026
Maintainer