OCR Benchmark Pipeline

FastAPI + SQLite application for building an OCR benchmark dataset from document page images.

Goal

Prepare high-quality, reviewer-validated OCR data with this workflow:

Discover images in input/ and index them.
Detect document layouts (DocLayNet YOLO model).
Review and fix layouts manually.
Extract OCR content from reviewed layouts (Gemini).
Review and fix extracted OCR content manually.

Current Product Surface

Dashboard (/):
- Pipeline actions with live counters: Scan(total) -> Review layouts(done/total) -> Review OCR(done/total) -> Export.
- Batch OCR action to queue/stop global OCR extraction for all eligible pages (layout_reviewed/ocr_failed) that still have missing layout outputs.
- Benchmark action opens dedicated benchmark page.
- Live backend activity panel (SSE stream).
- Duplicate-file warnings.
- Sortable + paginated indexed-images table (default: Added time newest first).
- Pagination controls with page size 25/50/100.
- Per-row actions: open Layout/OCR review and remove an image (with confirmation).
Layout benchmark (/static/layout_benchmark.html):
- Start/stop benchmark run.
- Recalculate score action to recompute scores from stored benchmark predictions without rerunning detection.
- Leaderboard + explorer matrix views with current running params highlight and best-so-far config.
- Hard-case subset reporting per config (hard_case_score, page count).
Layout review (/static/layouts.html?page_id=<id>):
- Editable class, reading order, bbox.
- Drag-and-drop reading order.
- Per-page reading-order mode selector: Auto, Single, Multi-column, Two-page.
- Reorder action recomputes reading order from the selected mode.
- Bbox editing from table and by canvas handles.
- Overlapping bbox borders are highlighted with striped warning segments.
- Quick source magnifier (M, hold Alt, or toolbar button) with layout overlays.
- Caption binding mode from caption bbox (Bind), with visible arrows to table/picture/formula targets and explicit unbind controls.
- Detect modal with model params, top-3 benchmark suggestions for model+imgsz, and in-flight busy state.
OCR review (/static/ocr_review.html?page_id=<id>):
- Source + reconstructed preview panels with synchronized scrolling.
- Review modes: Two panels and Line by line (slot-style line approval rail).
- Draft editing and per-layout restore.
- Quick source magnifier (M, hold Alt, or toolbar button) with OCR bbox overlays.
- Detect modal with layout selection, model picker, and generation params.
- OCR extraction is retried per bbox and then marked failed if still unsuccessful; failed bboxes stay editable and can be re-detected per-layout.
- Marking OCR reviewed requires resolving failed/missing required bboxes (re-detect or manual text entry).
- All pipeline steps are manual by reviewer action.

Configuration

Defaults are loaded from config.yaml (or APP_CONFIG_PATH).

source_dir: input
db_path: data/ocr_dataset.db
result_dir: result
allowed_image_extensions:
  - .jpg
  - .jpeg
  - .png
  - .tif
  - .tiff
  - .webp
enable_background_jobs: true
supported_ocr_models:
  - gemini-3-flash-preview
  - gemini-2.5-flash
gemini_keys: []

Environment overrides:

SOURCE_DIR
DB_PATH
RESULT_DIR
ALLOWED_IMAGE_EXTENSIONS (comma-separated)
APP_CONFIG_PATH
ENABLE_BACKGROUND_JOBS
SUPPORTED_OCR_MODELS (comma-separated)
GEMINI_KEYS (comma-separated)
GEMINI_USAGE_PATH

Run

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload

Open http://127.0.0.1:8000.

Tests

Backend:

.venv/bin/python -m unittest discover -s tests -p "test_*.py"

Frontend:

node --test frontend_tests/*.test.mjs

API Quick Reference

POST /api/discovery/scan
POST /api/state/wipe
GET /api/pages (supports limit, cursor, sort, dir)
GET /api/pages/summary
DELETE /api/pages/{page_id}
GET /api/pages/{page_id}/layouts
PATCH /api/pages/{page_id}/layout-order-mode
POST /api/pages/{page_id}/layouts/reorder
POST /api/pages/{page_id}/layouts/detect
POST /api/pages/{page_id}/layouts/review-complete
GET /api/pages/{page_id}/ocr-outputs
POST /api/pages/{page_id}/ocr/reextract
POST /api/pages/{page_id}/ocr/review-complete
GET /api/pipeline/activity
GET /api/pipeline/activity/stream
GET /api/layout-benchmark/status
GET /api/layout-benchmark/grid
POST /api/layout-benchmark/run
POST /api/layout-benchmark/stop
GET /api/ocr-batch/status
POST /api/ocr-batch/run
POST /api/ocr-batch/stop

OCR Prompt Debug Artifacts

Prompt source-of-truth (editable):

app/ocr_prompts.py
tests/fixtures/ocr_prompt_snapshots.json (golden prompt snapshots used by tests)

Generate prompt reference markdown deterministically:

.venv/bin/python scripts/generate_prompt_reference.py
Output: OCR_PROMPTS_REFERENCE.md

Gemini OCR response contract:

Gemini must return JSON with exactly one key: {"content":"..."}
Backend validates JSON shape and retries per existing retry policy on invalid responses.

Each OCR extraction run writes resolved text prompts (without image clip bytes) to:

_artifacts/ocr_prompts/<timestamp>_page_<page_id>.jsonl

Each JSONL row includes page/layout identifiers, class, output format, and the exact prompt sent to Gemini.

OCR Formatting Decisions Log

This section is a living log of OCR normalization decisions for dataset consistency. Add new items as rules are agreed.

Multiline emphasis in source text:
- If text is visually italic, bold, or bold+italic across multiple lines, apply Markdown markers per line (not once for the full block).
- Use *line* for italic, **line** for bold, and ***line*** for bold+italic.
- Start and end every affected line with its corresponding marker.
Diacritics policy (strict):
- Preserve diacritics exactly as visible in source text; do not simplify.
- Ground-truth text keeps stressed/diacritic forms (not stripped variants).
- Treat script lookalikes as different characters (Cyrillic vs Latin are not interchangeable).
- Store text in NFC form for consistency, but keep diacritic meaning unchanged.
Diacritics examples:
- Keep А́ (U+0410 + U+0301, Cyrillic А + combining stress), not plain А (U+0410).
- Keep ё (U+0451), not е (U+0435).
- Keep ә (U+04D9), not Latin a (U+0061).
- Do not replace Cyrillic А́ (U+0410 + U+0301) with Latin Á (U+00C1) or Latin Á (U+0041 + U+0301).
Header hierarchy policy:
- Default all extracted headers to level 4 (#### ).
- If page structure clearly shows hierarchy, adjust header levels.
- Use fewer # for higher-level headers: ### for higher, ## for top-level on the page.
- Keep #### for lower/subordinate headers when they are visually less prominent.
- Keep header levels consistent within the same page/document section.
Header hierarchy example:
- Top header: ## Chapter title
- Subheader: ### Section title
- Lower subheader: #### Subsection title
Multiline headers policy:
- Do not split one semantic header across multiple Markdown lines with only the first line marked as #.
- Do not encode continuation lines as bold paragraphs to imitate header continuation.
- Keep one semantic header as one Markdown header line by joining wrapped source lines into a single heading text.
- If visual line break must be preserved inside a heading, use an explicit HTML break inside the same header (for example: ## First line<br>Second line).
Error preservation policy:
- Dataset policy is to keep source text as printed, including typos, spelling mistakes, and grammar irregularities.
- OCR/review normalization does not correct linguistic errors automatically.
- When intended wording seems obvious, the original printed form is still retained.
- Punctuation or casing anomalies are retained when they are part of the source.
- Corrections are limited to clear OCR character misreads, without changing wording/style.

Documentation Policy

This repository keeps active project documentation in only two files:

README.md (product + usage)
AGENTS.md (engineering collaboration rules)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Benchmark Pipeline

Goal

Current Product Surface

Configuration

Run

Tests

API Quick Reference

OCR Prompt Debug Artifacts

OCR Formatting Decisions Log

Documentation Policy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
app		app
docs		docs
frontend_tests		frontend_tests
scripts		scripts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
OCR_PROMPTS_REFERENCE.md		OCR_PROMPTS_REFERENCE.md
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

OCR Benchmark Pipeline

Goal

Current Product Surface

Configuration

Run

Tests

API Quick Reference

OCR Prompt Debug Artifacts

OCR Formatting Decisions Log

Documentation Policy

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages