FastAPI + SQLite application for building an OCR benchmark dataset from document page images.
Prepare high-quality, reviewer-validated OCR data with this workflow:
- Discover images in
input/and index them. - Detect document layouts (DocLayNet YOLO model).
- Review and fix layouts manually.
- Extract OCR content from reviewed layouts (Gemini).
- Review and fix extracted OCR content manually.
- Dashboard (
/):- Pipeline actions with live counters:
Scan(total) -> Review layouts(done/total) -> Review OCR(done/total) -> Export. Batch OCRaction to queue/stop global OCR extraction for all eligible pages (layout_reviewed/ocr_failed) that still have missing layout outputs.Benchmarkaction opens dedicated benchmark page.- Live backend activity panel (SSE stream).
- Duplicate-file warnings.
- Sortable + paginated indexed-images table (default:
Added timenewest first). - Pagination controls with page size
25/50/100. - Per-row actions: open Layout/OCR review and remove an image (with confirmation).
- Pipeline actions with live counters:
- Layout benchmark (
/static/layout_benchmark.html):- Start/stop benchmark run.
Recalculate scoreaction to recompute scores from stored benchmark predictions without rerunning detection.- Leaderboard + explorer matrix views with current running params highlight and best-so-far config.
- Hard-case subset reporting per config (
hard_case_score, page count).
- Layout review (
/static/layouts.html?page_id=<id>):- Editable class, reading order, bbox.
- Drag-and-drop reading order.
- Per-page reading-order mode selector:
Auto,Single,Multi-column,Two-page. Reorderaction recomputes reading order from the selected mode.- Bbox editing from table and by canvas handles.
- Overlapping bbox borders are highlighted with striped warning segments.
- Quick source magnifier (
M, holdAlt, or toolbar button) with layout overlays. - Caption binding mode from caption bbox (
Bind), with visible arrows to table/picture/formula targets and explicit unbind controls. Detectmodal with model params, top-3 benchmark suggestions formodel+imgsz, and in-flight busy state.
- OCR review (
/static/ocr_review.html?page_id=<id>):- Source + reconstructed preview panels with synchronized scrolling.
- Review modes:
Two panelsandLine by line(slot-style line approval rail). - Draft editing and per-layout restore.
- Quick source magnifier (
M, holdAlt, or toolbar button) with OCR bbox overlays. Detectmodal with layout selection, model picker, and generation params.- OCR extraction is retried per bbox and then marked failed if still unsuccessful; failed bboxes stay editable and can be re-detected per-layout.
- Marking OCR reviewed requires resolving failed/missing required bboxes (re-detect or manual text entry).
- All pipeline steps are manual by reviewer action.
Defaults are loaded from config.yaml (or APP_CONFIG_PATH).
source_dir: input
db_path: data/ocr_dataset.db
result_dir: result
allowed_image_extensions:
- .jpg
- .jpeg
- .png
- .tif
- .tiff
- .webp
enable_background_jobs: true
supported_ocr_models:
- gemini-3-flash-preview
- gemini-2.5-flash
gemini_keys: []Environment overrides:
SOURCE_DIRDB_PATHRESULT_DIRALLOWED_IMAGE_EXTENSIONS(comma-separated)APP_CONFIG_PATHENABLE_BACKGROUND_JOBSSUPPORTED_OCR_MODELS(comma-separated)GEMINI_KEYS(comma-separated)GEMINI_USAGE_PATH
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reloadOpen http://127.0.0.1:8000.
Backend:
.venv/bin/python -m unittest discover -s tests -p "test_*.py"Frontend:
node --test frontend_tests/*.test.mjsPOST /api/discovery/scanPOST /api/state/wipeGET /api/pages(supportslimit,cursor,sort,dir)GET /api/pages/summaryDELETE /api/pages/{page_id}GET /api/pages/{page_id}/layoutsPATCH /api/pages/{page_id}/layout-order-modePOST /api/pages/{page_id}/layouts/reorderPOST /api/pages/{page_id}/layouts/detectPOST /api/pages/{page_id}/layouts/review-completeGET /api/pages/{page_id}/ocr-outputsPOST /api/pages/{page_id}/ocr/reextractPOST /api/pages/{page_id}/ocr/review-completeGET /api/pipeline/activityGET /api/pipeline/activity/streamGET /api/layout-benchmark/statusGET /api/layout-benchmark/gridPOST /api/layout-benchmark/runPOST /api/layout-benchmark/stopGET /api/ocr-batch/statusPOST /api/ocr-batch/runPOST /api/ocr-batch/stop
Prompt source-of-truth (editable):
app/ocr_prompts.pytests/fixtures/ocr_prompt_snapshots.json(golden prompt snapshots used by tests)
Generate prompt reference markdown deterministically:
.venv/bin/python scripts/generate_prompt_reference.py- Output:
OCR_PROMPTS_REFERENCE.md
Gemini OCR response contract:
- Gemini must return JSON with exactly one key:
{"content":"..."} - Backend validates JSON shape and retries per existing retry policy on invalid responses.
Each OCR extraction run writes resolved text prompts (without image clip bytes) to:
_artifacts/ocr_prompts/<timestamp>_page_<page_id>.jsonl
Each JSONL row includes page/layout identifiers, class, output format, and the exact prompt sent to Gemini.
This section is a living log of OCR normalization decisions for dataset consistency. Add new items as rules are agreed.
-
Multiline emphasis in source text:
- If text is visually italic, bold, or bold+italic across multiple lines, apply Markdown markers per line (not once for the full block).
- Use
*line*for italic,**line**for bold, and***line***for bold+italic. - Start and end every affected line with its corresponding marker.
-
Diacritics policy (strict):
- Preserve diacritics exactly as visible in source text; do not simplify.
- Ground-truth text keeps stressed/diacritic forms (not stripped variants).
- Treat script lookalikes as different characters (Cyrillic vs Latin are not interchangeable).
- Store text in NFC form for consistency, but keep diacritic meaning unchanged.
-
Diacritics examples:
- Keep
А́(U+0410+U+0301, CyrillicА+ combining stress), not plainА(U+0410). - Keep
ё(U+0451), notе(U+0435). - Keep
ә(U+04D9), not Latina(U+0061). - Do not replace Cyrillic
А́(U+0410+U+0301) with LatinÁ(U+00C1) or LatinÁ(U+0041+U+0301).
- Keep
-
Header hierarchy policy:
- Default all extracted headers to level 4 (
####). - If page structure clearly shows hierarchy, adjust header levels.
- Use fewer
#for higher-level headers:###for higher,##for top-level on the page. - Keep
####for lower/subordinate headers when they are visually less prominent. - Keep header levels consistent within the same page/document section.
- Default all extracted headers to level 4 (
-
Header hierarchy example:
- Top header:
## Chapter title - Subheader:
### Section title - Lower subheader:
#### Subsection title
- Top header:
-
Multiline headers policy:
- Do not split one semantic header across multiple Markdown lines with only the first line marked as
#. - Do not encode continuation lines as bold paragraphs to imitate header continuation.
- Keep one semantic header as one Markdown header line by joining wrapped source lines into a single heading text.
- If visual line break must be preserved inside a heading, use an explicit HTML break inside the same header (for example:
## First line<br>Second line).
- Do not split one semantic header across multiple Markdown lines with only the first line marked as
-
Error preservation policy:
- Dataset policy is to keep source text as printed, including typos, spelling mistakes, and grammar irregularities.
- OCR/review normalization does not correct linguistic errors automatically.
- When intended wording seems obvious, the original printed form is still retained.
- Punctuation or casing anomalies are retained when they are part of the source.
- Corrections are limited to clear OCR character misreads, without changing wording/style.
This repository keeps active project documentation in only two files:
README.md(product + usage)AGENTS.md(engineering collaboration rules)