Skip to content

tagay1n/ocr-benchmark-pipeline

Repository files navigation

OCR Benchmark Pipeline

FastAPI + SQLite application for building an OCR benchmark dataset from document page images.

Goal

Prepare high-quality, reviewer-validated OCR data with this workflow:

  1. Discover images in input/ and index them.
  2. Detect document layouts (DocLayNet YOLO model).
  3. Review and fix layouts manually.
  4. Extract OCR content from reviewed layouts (Gemini).
  5. Review and fix extracted OCR content manually.

Current Product Surface

  • Dashboard (/):
    • Pipeline actions with live counters: Scan(total) -> Review layouts(done/total) -> Review OCR(done/total) -> Export.
    • Batch OCR action to queue/stop global OCR extraction for all eligible pages (layout_reviewed/ocr_failed) that still have missing layout outputs.
    • Benchmark action opens dedicated benchmark page.
    • Live backend activity panel (SSE stream).
    • Duplicate-file warnings.
    • Sortable + paginated indexed-images table (default: Added time newest first).
    • Pagination controls with page size 25/50/100.
    • Per-row actions: open Layout/OCR review and remove an image (with confirmation).
  • Layout benchmark (/static/layout_benchmark.html):
    • Start/stop benchmark run.
    • Recalculate score action to recompute scores from stored benchmark predictions without rerunning detection.
    • Leaderboard + explorer matrix views with current running params highlight and best-so-far config.
    • Hard-case subset reporting per config (hard_case_score, page count).
  • Layout review (/static/layouts.html?page_id=<id>):
    • Editable class, reading order, bbox.
    • Drag-and-drop reading order.
    • Per-page reading-order mode selector: Auto, Single, Multi-column, Two-page.
    • Reorder action recomputes reading order from the selected mode.
    • Bbox editing from table and by canvas handles.
    • Overlapping bbox borders are highlighted with striped warning segments.
    • Quick source magnifier (M, hold Alt, or toolbar button) with layout overlays.
    • Caption binding mode from caption bbox (Bind), with visible arrows to table/picture/formula targets and explicit unbind controls.
    • Detect modal with model params, top-3 benchmark suggestions for model+imgsz, and in-flight busy state.
  • OCR review (/static/ocr_review.html?page_id=<id>):
    • Source + reconstructed preview panels with synchronized scrolling.
    • Review modes: Two panels and Line by line (slot-style line approval rail).
    • Draft editing and per-layout restore.
    • Quick source magnifier (M, hold Alt, or toolbar button) with OCR bbox overlays.
    • Detect modal with layout selection, model picker, and generation params.
    • OCR extraction is retried per bbox and then marked failed if still unsuccessful; failed bboxes stay editable and can be re-detected per-layout.
    • Marking OCR reviewed requires resolving failed/missing required bboxes (re-detect or manual text entry).
    • All pipeline steps are manual by reviewer action.

Configuration

Defaults are loaded from config.yaml (or APP_CONFIG_PATH).

source_dir: input
db_path: data/ocr_dataset.db
result_dir: result
allowed_image_extensions:
  - .jpg
  - .jpeg
  - .png
  - .tif
  - .tiff
  - .webp
enable_background_jobs: true
supported_ocr_models:
  - gemini-3-flash-preview
  - gemini-2.5-flash
gemini_keys: []

Environment overrides:

  • SOURCE_DIR
  • DB_PATH
  • RESULT_DIR
  • ALLOWED_IMAGE_EXTENSIONS (comma-separated)
  • APP_CONFIG_PATH
  • ENABLE_BACKGROUND_JOBS
  • SUPPORTED_OCR_MODELS (comma-separated)
  • GEMINI_KEYS (comma-separated)
  • GEMINI_USAGE_PATH

Run

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload

Open http://127.0.0.1:8000.

Tests

Backend:

.venv/bin/python -m unittest discover -s tests -p "test_*.py"

Frontend:

node --test frontend_tests/*.test.mjs

API Quick Reference

  • POST /api/discovery/scan
  • POST /api/state/wipe
  • GET /api/pages (supports limit, cursor, sort, dir)
  • GET /api/pages/summary
  • DELETE /api/pages/{page_id}
  • GET /api/pages/{page_id}/layouts
  • PATCH /api/pages/{page_id}/layout-order-mode
  • POST /api/pages/{page_id}/layouts/reorder
  • POST /api/pages/{page_id}/layouts/detect
  • POST /api/pages/{page_id}/layouts/review-complete
  • GET /api/pages/{page_id}/ocr-outputs
  • POST /api/pages/{page_id}/ocr/reextract
  • POST /api/pages/{page_id}/ocr/review-complete
  • GET /api/pipeline/activity
  • GET /api/pipeline/activity/stream
  • GET /api/layout-benchmark/status
  • GET /api/layout-benchmark/grid
  • POST /api/layout-benchmark/run
  • POST /api/layout-benchmark/stop
  • GET /api/ocr-batch/status
  • POST /api/ocr-batch/run
  • POST /api/ocr-batch/stop

OCR Prompt Debug Artifacts

Prompt source-of-truth (editable):

  • app/ocr_prompts.py
  • tests/fixtures/ocr_prompt_snapshots.json (golden prompt snapshots used by tests)

Generate prompt reference markdown deterministically:

  • .venv/bin/python scripts/generate_prompt_reference.py
  • Output: OCR_PROMPTS_REFERENCE.md

Gemini OCR response contract:

  • Gemini must return JSON with exactly one key: {"content":"..."}
  • Backend validates JSON shape and retries per existing retry policy on invalid responses.

Each OCR extraction run writes resolved text prompts (without image clip bytes) to:

  • _artifacts/ocr_prompts/<timestamp>_page_<page_id>.jsonl

Each JSONL row includes page/layout identifiers, class, output format, and the exact prompt sent to Gemini.

OCR Formatting Decisions Log

This section is a living log of OCR normalization decisions for dataset consistency. Add new items as rules are agreed.

  • Multiline emphasis in source text:

    • If text is visually italic, bold, or bold+italic across multiple lines, apply Markdown markers per line (not once for the full block).
    • Use *line* for italic, **line** for bold, and ***line*** for bold+italic.
    • Start and end every affected line with its corresponding marker.
  • Diacritics policy (strict):

    • Preserve diacritics exactly as visible in source text; do not simplify.
    • Ground-truth text keeps stressed/diacritic forms (not stripped variants).
    • Treat script lookalikes as different characters (Cyrillic vs Latin are not interchangeable).
    • Store text in NFC form for consistency, but keep diacritic meaning unchanged.
  • Diacritics examples:

    • Keep А́ (U+0410 + U+0301, Cyrillic А + combining stress), not plain А (U+0410).
    • Keep ё (U+0451), not е (U+0435).
    • Keep ә (U+04D9), not Latin a (U+0061).
    • Do not replace Cyrillic А́ (U+0410 + U+0301) with Latin Á (U+00C1) or Latin (U+0041 + U+0301).
  • Header hierarchy policy:

    • Default all extracted headers to level 4 (#### ).
    • If page structure clearly shows hierarchy, adjust header levels.
    • Use fewer # for higher-level headers: ### for higher, ## for top-level on the page.
    • Keep #### for lower/subordinate headers when they are visually less prominent.
    • Keep header levels consistent within the same page/document section.
  • Header hierarchy example:

    • Top header: ## Chapter title
    • Subheader: ### Section title
    • Lower subheader: #### Subsection title
  • Multiline headers policy:

    • Do not split one semantic header across multiple Markdown lines with only the first line marked as #.
    • Do not encode continuation lines as bold paragraphs to imitate header continuation.
    • Keep one semantic header as one Markdown header line by joining wrapped source lines into a single heading text.
    • If visual line break must be preserved inside a heading, use an explicit HTML break inside the same header (for example: ## First line<br>Second line).
  • Error preservation policy:

    • Dataset policy is to keep source text as printed, including typos, spelling mistakes, and grammar irregularities.
    • OCR/review normalization does not correct linguistic errors automatically.
    • When intended wording seems obvious, the original printed form is still retained.
    • Punctuation or casing anomalies are retained when they are part of the source.
    • Corrections are limited to clear OCR character misreads, without changing wording/style.

Documentation Policy

This repository keeps active project documentation in only two files:

  • README.md (product + usage)
  • AGENTS.md (engineering collaboration rules)

About

Language-agnostic OCR benchmark pipeline to discover document images, review/edit layouts in a web UI, run OCR extraction, and build high-quality evaluation datasets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors