codex-forge

AI-first, modular pipeline for turning scanned books into structured JSON with full traceability.

Pipeline Architecture

The pipeline follows a 5-stage model:

Intake → IR (generic): PDF/images → structured elements (Unstructured library provides rich IR with text, types, coordinates, tables)
Verify IR (generic): QA checks on completeness, page coverage, element quality
Portionize (domain-specific): Identify logical portions (CYOA sections, genealogy chapters, textbook problems) and reference IR elements
Augment (domain-specific): Enrich portions with domain data (choices/combat for CYOA, relationships for genealogy)
Export (format-specific): Output to target format (FF Engine JSON, HTML, Markdown) using IR + augmentations

Steps 1-2 are universal across all document types. Steps 3-4 vary by domain (gamebooks vs genealogies vs textbooks). Step 5 is tied to output requirements (precise layout for PDF, simplified for Markdown).

Reusability goal: Keep upstream intake/OCR modules as generic as possible. Prefer pushing booktype-specific heuristics/normalization (e.g., gamebook navigation phrase canonicalization, FF conventions) downstream into booktype-aware portionize/extract/enrich/export modules or recipe-scoped adapters so the OCR stack can be reused across book types.

The Intermediate Representation (IR) stays unchanged throughout; portionization and augmentation annotate/reference it rather than transforming it.

What it does (today)

Ingest PDF or page images → structured element IR (Unstructured or OCR-based)
Multimodal LLM cleaning → per-page clean text + confidence
Sliding-window portionization (LLM, optional priors, multimodal) → portions reference IR elements
Consensus/dedupe/normalize, resolve overlaps, guarantee coverage
Assemble per-portion JSON (page spans, source images, raw_text from IR)
Run outputs stored under output/runs/<run_id>/ with manifests and state

Edgecase Scanner + Patch Workflow (Post-Extraction)

Use this when you want a targeted, auditable pass over extracted gameplay logic without baking book-specific hacks into core modules.

High-level flow:

Extract turn_to_links early (anchor-derived) during portionization.
Downstream extractors claim links (combat/luck/stat checks/choices) via turn_to_claims.
Reconcile claimed vs. total links → unclaimed targets are high-confidence edge cases.
Scan the gamebook for edgecase patterns and emit a structured report.
AI verify only flagged sections → emit patch JSONL (empty when correct).
Apply patches deterministically (opt-in via recipe) to produce a patched gamebook.

Recommended run (reuse an existing full run; do not re-run OCR):

python driver.py \
  --recipe configs/recipes/recipe-ff-ai-ocr-gpt51-resume-edgecase-scan.yaml \
  --run-id edgecase-scan-<run_id> \
  --output-dir output/runs/edgecase-scan-<run_id>

Artifacts to inspect:

output/runs/<edgecase-run>/04_turn_to_link_reconciler_v1/turn_to_unclaimed.jsonl
output/runs/<edgecase-run>/05_edgecase_scanner_v1/edgecase_scan.jsonl
output/runs/<edgecase-run>/06_edgecase_ai_patch_v1/edgecase_patches.jsonl
output/runs/<edgecase-run>/07_apply_edgecase_patches_v1/gamebook_patched.json

Run Configuration (Simplified Workflow)

Running the pipeline via CLI flags can be error-prone. Use the simplified workflow with run configuration files.

1. Create a run configuration template

python tools/run_manager.py create-run my-new-run

This generates output/runs/my-new-run/config.yaml.

2. Edit the configuration

Customize output/runs/my-new-run/config.yaml with your recipe, input PDF, and options.

Key Concept: The recipe defines the logic (stages), while this config.yaml defines the context (input PDF, Output Dir, Run ID).

3. Execute the run

python tools/run_manager.py execute-run my-new-run

You can still pass additional CLI overrides if needed:

python tools/run_manager.py execute-run my-new-run --dry-run

Repository layout

CLI modules/scripts: pages_dump.py, clean_pages.py, portionize.py, consensus.py, dedupe_portions.py, normalize_portions.py, resolve_overlaps.py, build_portion_text.py, etc.
docs/requirements.md: system requirements
snapshot.md: current status and pipeline notes
output/: git-ignored; run artifacts live at output/runs/<run_id>/
- Artifact organization: Each module has its own folder {ordinal:02d}_{module_id}/ (e.g., 01_extract_ocr_ensemble_v1/) containing its artifacts
- Final outputs: gamebook.json stays in root for easy access
- Game-ready package: output/runs/<run_id>/output/ (contains gamebook.json, validator/, and README)
- Pipeline metadata: pipeline_state.json, pipeline_events.jsonl, snapshots/ in root
settings.example.yaml: sample config
Driver snapshots: each run writes snapshots/ (recipe.yaml, plan.json, registry.json, optional settings/pricing/instrumentation configs) and records paths in output/run_manifest.jsonl for reproducibility.
Shared helpers for module entrypoints live in modules/common/ (utils, OCR helpers).

Modular driver (current)

Modules live under modules/<stage>/<module_id>/; recipes live in configs/recipes/.
Driver orchestrates stages, stamps artifacts with schema/module/run IDs, and tracks state in pipeline_state.json.
Swap modules by changing the recipe, e.g. OCR vs text ingest.

Fighting Fantasy Book Structure

Running Headers (Section Ranges):

Fighting Fantasy gamebooks use running headers in the upper corners of gameplay pages
Left page (L): Shows section range (e.g., "9-10", "18-21") indicating which sections are on that page
Right page (R): Shows single section number (e.g., "22") or range indicating sections on that page
These are NOT page numbers - they indicate which gameplay sections (1-400) appear on the page
Format: Either ranges like "X-Y" (sections X through Y) or single numbers like "Z" (section Z only)
Position: Upper outside corners (top-left for left pages, top-right for right pages)

Coordinate System Note:

OCR engines may use different coordinate systems (standard: y=0=top, inverted: y=0=bottom)
Running headers at top corners may have high y values (0.9+) if coordinate system is inverted
Pattern detection must account for this when identifying top vs bottom positions

Legacy OCR Ensemble Recipe Reference (Archived)

Current canonical recipe: configs/recipes/recipe-ff-ai-ocr-gpt51.yaml (GPT-5.1 AI-first OCR, HTML-first output). The legacy OCR-ensemble recipe is archived at configs/recipes/legacy/recipe-ff-canonical.yaml; the module list below is preserved for historical reference.

Intake Stage

01. extract_ocr_ensemble_v1 (Code + AI escalation)

What it does: Runs multiple OCR engines (Tesseract, EasyOCR, Apple Vision, PDF text) in parallel and combines results with voting/consensus
Why: Different engines excel at different fonts/layouts; ensemble improves accuracy
Try: Code (multi-engine OCR)
Validate: Code (disagreement scoring)
Escalate: AI (GPT-4V vision transcription for high-disagreement pages)

02. easyocr_guard_v1 (Code)

What it does: Validates that EasyOCR produced text for sufficient pages
Why: EasyOCR is primary engine; missing output indicates critical failure
Type: Code-only validation guard

03. pick_best_engine_v1 (Code)

What it does: Selects the best OCR engine output per page based on quality metrics, preserves standalone numeric headers from all engines
Why: Reduces noise while preserving critical section headers that might only appear in one engine
Type: Code-only selection

04. inject_missing_headers_v1 (Code)

What it does: Scans raw OCR engine outputs for numeric headers (1-400) missing from picked output and injects them
Why: Critical for 100% section coverage; headers can be lost during engine selection
Type: Code-only injection

05. ocr_escalate_gpt4v_v1 (AI)

What it does: Re-transcribes high-disagreement or low-quality pages using GPT-4V vision model
Why: Vision models can read corrupted/scanned text that OCR engines miss
Type: AI escalation (targeted, budget-capped)

06. merge_ocr_escalated_v1 (Code)

What it does: Merges original OCR pages with escalated GPT-4V pages into unified final OCR output
Why: Creates single authoritative OCR artifact for downstream stages
Type: Code-only merge

07. reconstruct_text_v1 (Code)

What it does: Merges fragmented OCR lines into coherent paragraphs while preserving section boundaries
Why: Cleaner text improves downstream AI accuracy and human readability
Type: Code-only reconstruction

08. pagelines_to_elements_v1 (Code)

What it does: Converts pagelines IR (OCR output) into elements_core.jsonl (structured element IR)
Why: Standardizes format for downstream portionization stages
Type: Code-only transformation

09. elements_content_type_v1 (Code + optional AI)

What it does: Classifies elements into DocLayNet types (Section-header, Text, Page-footer, etc.) using text-first heuristics
Why: Content type tags enable code-first boundary detection (filters for Section-header)
Try: Code (heuristic classification)
Escalate: Optional AI (LLM classification for low-confidence items, disabled by default)

Portionize Stage

10. coarse_segment_v1 (AI)

What it does: Single LLM call to classify entire book into frontmatter/gameplay/endmatter page ranges
Why: Establishes macro boundaries before fine-grained section detection
Type: AI classification (one call for entire book)

11. fine_segment_frontmatter_v1 (AI)

What it does: Divides frontmatter section into logical portions (title, copyright, TOC, rules, etc.)
Why: Structures non-gameplay content for completeness
Type: AI segmentation

12. classify_headers_v1 (AI)

What it does: Batched AI calls to classify elements as macro headers, game section headers, or neither
Why: Provides header candidates for global structure analysis
Type: AI classification (batched, forward/backward redundancy)

13. structure_globally_v1 (AI, currently stubbed)

What it does: Single AI call to create coherent global document structure from header candidates
Why: Creates ordered section structure with macro sections and game sections
Type: AI structuring (currently skipped via stub)

14. detect_boundaries_code_first_v1 (Code + AI escalation)

What it does: Code-first section boundary detection with targeted AI escalation for missing sections
Why: Replaces expensive batched AI with free code filter + 0-30 targeted AI calls; achieves 95%+ coverage
Try: Code (filters elements_core_typed for Section-header with valid numbers, applies multi-stage validation)
Validate: Code (coverage check vs target)
Escalate: AI (targeted re-scan of pages with missing sections using GPT-5)
Type: Code-first with AI escalation

15. portionize_ai_scan_v1 (AI, fallback)

What it does: Full-document AI scan for section boundaries (fallback if code-first fails)
Why: Backup method if code-first detection misses too many sections
Type: AI fallback

16. macro_locate_ff_v1 (AI)

What it does: Identifies frontmatter/main_content/endmatter pages from minimal OCR text
Why: Provides macro section hints for structure analysis
Type: AI location

17. merge_boundaries_pref_v1 (Code)

What it does: Merges primary boundary set with fallback, preferring primary and filling gaps
Why: Combines code-first results with AI fallback for maximum coverage
Type: Code-only merge

18. verify_boundaries_v1 (Code + optional AI)

What it does: Validates section boundaries with deterministic checks (ordering, duplicates) and optional AI spot-checks
Why: Catches boundary errors before expensive extraction stage
Try: Code (deterministic validation)
Escalate: Optional AI (spot-checks sampled boundaries for mid-sentence starts)
Type: Code validation with optional AI sampling

19. validate_boundary_coverage_v1 (Code)

What it does: Ensures boundary set covers expected section IDs and meets minimum count
Why: Fails fast if coverage is too low
Type: Code-only validation

20. validate_boundaries_gate_v1 (Code)

What it does: Final gate check before extraction (count, ordering, gaps)
Why: Prevents proceeding with invalid boundary set
Type: Code-only gate

21. portionize_ai_extract_v1 (AI)

What it does: Extracts section text from elements and parses gameplay data (choices, combat, luck tests, items) using AI
Why: AI understands context and can extract structured gameplay data from narrative text
Type: AI extraction (per-section calls)

22. repair_candidates_v1 (Code)

What it does: Detects sections needing repair (garbled text, low alpha ratio, high digit ratio) using heuristics
Why: Identifies problematic sections before expensive repair stage
Type: Code-only detection

23. repair_portions_v1 (AI)

What it does: Re-reads flagged sections with multimodal LLM (GPT-5) to repair garbled text
Why: Vision models can transcribe corrupted text that OCR missed
Type: AI repair (targeted, budget-capped)

24. strip_section_numbers_v1 (Code)

What it does: Removes section/page number artifacts from section text while preserving paragraph structure
Why: Clean text for final gamebook output
Type: Code-only cleaning

Extract Stage

25. extract_choices_v1 (Code + optional AI)

What it does: Extracts choices from section text using deterministic pattern matching ("turn to X", "go to Y")
Why: Code-first approach is faster, cheaper, and more reliable than pure AI extraction
Try: Code (pattern matching)
Escalate: Optional AI (validation for ambiguous cases, disabled by default)
Type: Code-first with optional AI validation

Build Stage

26. build_ff_engine_v1 (Code)

What it does: Assembles final gamebook.json from portions with choices, combat, items, etc.
Why: Creates final output format for game engine consumption
Type: Code-only assembly
Output note: Gameplay flow is encoded in ordered sequence events (replaces legacy navigation).

Combat Outcome Conventions (Sequence Events)

Combat requires outcomes: every combat event must include outcomes.win.
Outcome refs: outcomes.{win,lose,escape} are OutcomeRef objects with either targetSection or terminal.
Continue in-section: when combat win immediately continues within the same section (e.g., “Test your Luck”), set outcomes.win = { terminal: { kind: "continue" } } and add a player_round_win trigger to indicate the round count if stated.
Triggers: use triggers for mid-combat conditions (e.g., enemy_attack_strength_total, enemy_round_win, player_round_win).
Split-target fights: multi-part enemies (e.g., pincers/heads) are represented as multiple enemies with mode: "split-target" and structured rules; avoid single‑enemy split-target output.

Example (basic win/lose):

{
  "kind": "combat",
  "mode": "single",
  "enemies": [{"enemy": "CAVE BEAST", "skill": 7, "stamina": 8}],
  "outcomes": {
    "win": {"targetSection": "163"},
    "lose": {"terminal": {"kind": "death"}}
  }
}

Example (win continues in-section with Test Your Luck):

{
  "kind": "combat",
  "mode": "single",
  "enemies": [{"enemy": "BLOODBEAST", "skill": 12, "stamina": 10}],
  "triggers": [{
    "kind": "player_round_win",
    "count": 1,
    "outcome": {"terminal": {"kind": "continue"}}
  }],
  "outcomes": {"win": {"terminal": {"kind": "continue"}}}
}

Validate Stage

27. validate_ff_engine_node_v1 (Node/AJV)

What it does: Canonical schema validator shared with the game engine (Node + Ajv)
Why: Ensures pipeline/game engine use identical validation logic
Type: Node validator (bundled, portable)
Scope: Generic across Fighting Fantasy books (not tuned to a specific title)
Ship: Include modules/validate/validate_ff_engine_node_v1/validator alongside gamebook.json in the game engine build.
How to ship: Copy gamebook.json + modules/validate/validate_ff_engine_node_v1/validator/gamebook-validator.bundle.js into the game engine bundle, then run node gamebook-validator.bundle.js gamebook.json --json before loading.
Validation notes:
- Combat events must include outcomes.win (required by schema).
- Missing-section checks use metadata.sectionCount when present; otherwise fallback to provenance.expected_range, then default 1–400.

28. forensics_gamebook / validate_ff_engine_v2 (Code)

What it does: Forensic validation (missing sections, duplicates, empty sections, structural issues)
Why: Provides detailed traces for debugging and repair; not the canonical schema validator
Type: Code-only validation

29. validate_choice_completeness_v1 (Code)

What it does: Compares "turn to X" references in section text with extracted choices to find missing choices
Why: Critical for 100% game engine accuracy; missing choices break gameplay
Type: Code-only validation (pattern matching + comparison)

Two Ways to Run the Pipeline

1. Regular Production Runs (output in output/runs/)

Purpose: Real pipeline runs that should be preserved and tracked
Location: Artifacts go to output/runs/<run_id>/ (default or from recipe)
When to use: Actual book processing, production runs, runs you want to keep
Manifest: Automatically registered in output/run_manifest.jsonl for tracking

Example:

# Full canonical FF recipe run (GPT-5.1 OCR; no ARM64/MPS requirement)
python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml --run-id deathtrap-dungeon-20251225

# With instrumentation
python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml --run-id deathtrap-dungeon-20251225 --instrument

2. Temporary Test Runs (output in /tmp or /private/tmp)

Purpose: Quick testing, development, debugging, AI agent experimentation
Location: Artifacts go to /tmp or /private/tmp (via --output-dir override)
When to use:
- Testing new modules or recipe changes
- Debugging pipeline issues
- AI agents doing temporary test runs during development
- Quick smoke tests on subsets
Not tracked: These runs are NOT registered in output/run_manifest.jsonl (they're temporary)

Example:

# Temporary test run (AI agents use this for development/testing)
python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml \
  --run-id cf-ff-ai-ocr-gpt51-test \
  --output-dir /private/tmp/cf-ff-ai-ocr-gpt51-test \
  --force

# Smoke test with subset (GPT-5.1 OCR; no ARM64/MPS requirement)
python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml \
  --settings configs/settings.ff-ai-ocr-gpt51-smoke-20.yaml \
  --run-id ff-ai-ocr-gpt51-smoke-20 \
  --output-dir /tmp/cf-ff-ai-ocr-gpt51-smoke-20 \
  --force

Key Differences:

Regular runs: Use default output/runs/<run_id>/ (or recipe output_dir), registered in manifest
Temporary runs: Use --output-dir to override to /tmp or /private/tmp, NOT registered in manifest
AI Agents: Should use temporary runs (--output-dir /private/tmp/...) for testing/development, and only use regular runs for actual production work

Smoke Tests (Quick Reference)

Canonical smoke (current pipeline): configs/recipes/recipe-ff-ai-ocr-gpt51.yaml + configs/settings.ff-ai-ocr-gpt51-smoke-20.yaml
Offline fixture smoke (no external calls): configs/recipes/recipe-ff-smoke.yaml (uses testdata/smoke/ff/)
Legacy/archived smoke: configs/recipes/legacy/recipe-ocr-coarse-fine-smoke.yaml and configs/settings.ff-canonical-smoke*.yaml (legacy OCR pipeline)

Common Driver Commands

# Dry-run legacy OCR recipe (archived)
python driver.py --recipe configs/recipes/legacy/recipe-ocr.yaml --dry-run

# Text ingest with mock LLM stages (for tests without API calls)
python driver.py --recipe configs/recipes/recipe-text.yaml --mock --skip-done

# OCR pages 1–20 real run (auto-generated run_id/output_dir by default)
python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml --force

# Reuse a specific run_id/output_dir (opt-in)
python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml --run-id myrun --allow-run-id-reuse

# Resume legacy OCR run from portionize onward (reuses cached clean pages)
python driver.py --recipe configs/recipes/legacy/recipe-ocr.yaml --skip-done --start-from portionize_fine

# Swap modules: edit configs/recipes/*.yaml to choose a different module per stage
# (e.g., set stage: extract -> module: extract_text_v1 instead of extract_ocr_v1)

Runtime note: full non-mock OCR on the 113-page sample typically takes ~35–40 minutes for the portionize/LLM window stage (gpt-4.1-mini + boost gpt-5). Use --skip-done with --start-from/--end-at to resume or scope reruns without re-cleaning pages.

Each run emits a lightweight timing_summary.json in the run directory with wall seconds per stage (and pages/min for intake/extract when available).

Apple Silicon vs x86_64 (legacy OCR + hi_res notes)

Canonical GPT-5.1 OCR runs on any arch; no MPS requirement.
Prefer the ARM64 Python env on Apple Silicon for legacy Unstructured hi_res intake: ~/miniforge3/envs/codex-arm/bin/python (reports platform.machine() == "arm64"). Unstructured hi_res runs successfully here and yields far better header/section recall.
On x86_64 (Rosetta) the TensorFlow build expects AVX and forces legacy hi_res to fall back to strategy: fast, which markedly reduces header detection and downstream section coverage.
Legacy OCR ensemble recipes (archived under configs/recipes/legacy/) defaulted to strategy: hi_res and rely on EasyOCR; these notes apply only to legacy recipes.
EasyOCR auto-uses GPU when Metal/MPS is available (Apple Silicon) and falls back to CPU otherwise; no flags needed. Use --allow-run-id-reuse only if you explicitly want to reuse an existing run directory; defaults now auto-generate a fresh run_id/output_dir per run.
Metal-friendly env recipe (legacy EasyOCR; pins torch 2.9.1 / torchvision 0.24.1 / Pillow<13):
```
conda create -n codex-arm-mps python=3.11
conda activate codex-arm-mps
pip install --no-cache-dir -r requirements-legacy-easyocr.txt -c constraints/metal.txt
python - <<'PY'
import torch; print(torch.__version__, torch.backends.mps.is_available())
PY
```
If mps.is_available() is false, you are on the wrong arch or missing the Metal wheel. After a GPU smoke run, sanity-check that EasyOCR used MPS:
```
python scripts/regression/check_easyocr_gpu.py --debug-file /tmp/cf-easyocr-mps-5/ocr_ensemble/easyocr_debug.jsonl
```
One-shot local smoke + check:
```
./scripts/smoke_easyocr_gpu.sh /tmp/cf-easyocr-mps-5
```
MPS troubleshooting: ensure platform.machine() == "arm64", Xcode CLTs installed, and you’re using the arm64 Python from the codex-arm-mps env. Reinstall with the Metal constraints if torch shows mps.is_available() == False.
Keep the "hi_res first, fast fallback" knob: run ARM hi_res by default, and only flip to settings.fast-intake.yaml when the environment lacks ARM/AVX. Prior runs showed a large coverage drop when forced to fast, so treat fast as a compatibility fallback, not a peer mode.
Recommended full run on ARM:
~/miniforge3/envs/codex-arm/bin/python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml --run-id <run> --output-dir <dir> --force
macOS-only Vision OCR: a new module extract_ocr_apple_v1 (and optional apple engine in extract_ocr_ensemble_v1) uses VNRecognizeTextRequest. It compiles a Swift helper at runtime; only available on macOS with Xcode CLTs installed.
- Sandbox caveat: In restricted/sandboxed execution, Apple Vision can fail with errors like sysctlbyname for kern.hv_vmm_present failed (and emit empty/no apple text). If you hit this, run the OCR stage outside the sandbox / with full host permissions, or disable apple for that run.

DAG recipes (coarse+fine merge example)

# Dry-run canonical OCR (GPT-5.1)
python driver.py --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml --dry-run

# Text ingest DAG with mock LLM stages (fast, no API calls)
python driver.py --recipe configs/recipes/recipe-text-dag.yaml --mock --skip-done

# Quick smoke: coarse+fine+continuation on first 10 pages (legacy, archived)
python driver.py --recipe configs/recipes/legacy/recipe-ocr-coarse-fine-smoke.yaml --force

# Continuation regression check (after a run)
python scripts/regression/check_continuation_propagation.py \
  --hypotheses output/runs/deathtrap-ocr-dag/adapter_out.jsonl \
  --locked output/runs/deathtrap-ocr-dag/portions_locked_merged.jsonl \
  --resolved output/runs/deathtrap-ocr-dag/portions_resolved.jsonl

Key points:

Stages have ids and needs; driver topo-sorts and validates schemas.
Override per-stage outputs via either a stage-level out: key (highest precedence) or the recipe-level outputs: map.
Removed (Story 025): image_crop_cv_v1, portionize_page_v1, portionize_numbered_v1, merge_portion_hyp_v1, consensus_spanfill_v1, enrich_struct_v1, build_appdata_v1; demo/alt recipes using them were deleted.

Parameter validation & output overrides

Each module can declare param_schema (JSON-Schema-lite) in its module.yaml to type-check params before the run. Supported fields per param: type (string|number|integer|boolean), enum, minimum/maximum, pattern, default; mark required via a top-level required list or required: true on the property.
Driver merges default_params + recipe params, applies schema defaults, and fails fast on missing/unknown/invalid params with a message that includes the stage id and module id.
Example: Param 'min_conf' on stage 'clean_pages' (module clean_llm_v1) expected type number, got str.
Set custom filenames per stage with out: inside the stage config; this overrides recipe outputs: and the built-in defaults, and the resolved name is used for resume/skip-done and downstream inputs.

Example snippet with stage-level out:

stages:
  - id: clean_pages
    stage: clean
    module: clean_llm_v1
    needs: [extract_text]
    out: pages_clean_custom.jsonl

Artifacts appear under output/runs/<run_id>/ as listed in the recipe; use --skip-done to resume and --force to rerun stages.

Output conventions

output/runs/<run_id>/ contains all artifacts: images/, ocr/, pages_raw/clean, hypotheses, locked/normalized/resolved portions, final JSON, pipeline_state.json.
output/run_manifest.jsonl lists runs (id, path, date, notes).

Instrumentation (timing & cost)

Enable per-stage timing and LLM cost reporting with --instrument (off by default).
Optional price sheet override via --price-table configs/pricing.default.yaml or recipe instrumentation.price_table.
Outputs land beside artifacts: instrumentation.json (machine-readable), instrumentation.md (summary tables), and raw instrumentation_calls.jsonl when present. Manifest entries link to the reports.
Modules can emit call-level usage via modules.common.utils.log_llm_usage(...); the driver aggregates tokens/costs per stage and per model.

Run monitoring

Preferred: scripts/run_driver_monitored.sh (spawns driver, writes driver.pid, tails pipeline_events.jsonl).
- Example: scripts/run_driver_monitored.sh --recipe configs/recipes/recipe-ff-ai-ocr-gpt51.yaml --run-id <run_id> --output-dir output/runs -- --instrument
- Important: run_driver_monitored.sh expects --output-dir to be the parent (e.g., output/runs) and passes the full run dir to driver.py. Do not pass a run-specific path.
- If you pass --force, the script pre-deletes the run dir, strips --force, and adds --allow-run-id-reuse so the driver accepts the created run dir without wiping the log/pidfile mid-run.
Attach to an existing run: scripts/monitor_run.sh output/runs/<run_id> output/runs/<run_id>/driver.pid 5
Foreground one-liner (useful if background terminal support interferes):
- while true; do date; tail -n 1 output/runs/<run_id>/pipeline_events.jsonl; sleep 60; done
Crash visibility: prefer scripts/run_driver_monitored.sh so stderr is captured in driver.log. scripts/monitor_run.sh now tails driver.log when the PID disappears to surface hard failures (e.g., OpenMP SHM errors).
scripts/monitor_run.sh also appends a synthetic run_monitor failure event to pipeline_events.jsonl when the driver PID disappears, so tailing events shows the crash.
scripts/run_driver_monitored.sh runs scripts/postmortem_run.sh on exit to append a run_postmortem failure event when the PID is gone.

Cost/perf presets and benchmarks

Preset settings live in configs/presets/:
- speed.text.yaml (text recipe, gpt-4.1-mini, ~8s/page, ~$0.00013/page)
- cost.ocr.yaml (OCR, gpt-4.1-mini, ~13–18s/page, ~$0.0011/page)
- balanced.ocr.yaml (OCR, gpt-4.1, ~16–34s/page, ~$0.014–0.026/page)
- quality.ocr.yaml (OCR, gpt-5, ~70–100s/page, ~$0.015–0.020/page)

Use with the driver by passing --settings, e.g.:

python driver.py --recipe configs/recipes/recipe-text.yaml --settings configs/presets/speed.text.yaml --instrument
python driver.py --recipe configs/recipes/legacy/recipe-ocr.yaml  --settings configs/presets/cost.ocr.yaml --instrument

Bench sessions write metrics to output/runs/bench-*/bench_metrics.csv and metadata.json (slices, models, price table, runs). Example sessions:
- output/runs/bench-cost-perf-ocr-20251124c/bench_metrics.csv
- output/runs/bench-cost-perf-text-20251124e/bench_metrics.csv

Pipeline visibility dashboard

Serve from repo root: python -m http.server 8000 then open http://localhost:8000/docs/pipeline-visibility.html.
The page polls output/run_manifest.jsonl for run ids, then reads output/runs/<run_id>/pipeline_state.json and pipeline_events.jsonl for live progress, artifacts, and confidence stats.
A ready-to-use fixture run lives at output/runs/dashboard-fixture (listed in the manifest) so you can smoke the dashboard without running the pipeline.

Roadmap (high level)

Enrichment (choices, cross-refs, combat/items/endings)
Turn-to validator (CYOA), layout-preserving extractor, image cropper/mapper
Coarse+fine portionizer; continuation merge
AI planner to pick modules/configs based on user goals

Legacy OCR Strategy Choice (Unstructured intake)

Legacy Unstructured intake only. The canonical GPT-5.1 OCR pipeline does not use hi_res/ocr_only strategies.

Legacy recommendation: hi_res on ARM64, ocr_only on x86_64

⚠️ Before choosing a strategy: Check your Python architecture (python -c "import platform; print(platform.machine())"). On Apple Silicon Macs, verify if ARM64 environment exists even if your current shell is using x86_64.

After comprehensive testing comparing old Tesseract-based OCR with Unstructured strategies (ocr_only vs hi_res):

hi_res on ARM64: ~15% faster (88s/page vs 105s/page), extracts ~35% more granular elements (better layout boundaries), same text quality as ocr_only. Use when ARM64 environment is available (Story 033 complete).
ocr_only: More compatible (works on x86_64/Rosetta without JAX), similar text quality, fewer elements. Use as fallback or when maximum compatibility is needed.

Note: OCR text quality is source-limited (scanned PDF quality determines accuracy), so strategy choice primarily affects performance and element granularity, not character recognition accuracy.

Legacy Environment Setup (Unstructured/EasyOCR)

⚠️ IMPORTANT: This section applies to legacy Unstructured/EasyOCR intake only. The canonical GPT-5.1 OCR pipeline runs with requirements.txt on any arch and does not require ARM64/MPS or JAX.

Check Your Environment First

Before assuming x86_64/Rosetta, check if you have an ARM64 environment available:

# Check if ARM64 environment exists
ls -la ~/miniforge3/envs/codex-arm/bin/python 2>/dev/null && echo "ARM64 environment available"

# Check current Python architecture
python -c "import platform; print(f'Machine: {platform.machine()}')"
# ARM64 native: "Machine: arm64"
# x86_64/Rosetta: "Machine: x86_64"

# Check ARM64 environment architecture
~/miniforge3/envs/codex-arm/bin/python -c "import platform; print(f'Machine: {platform.machine()}')" 2>/dev/null
# Should show: "Machine: arm64"

On Apple Silicon (M-series) Macs: You likely have both environments. Always check for ARM64 first and use it for better performance unless you have a specific reason to use x86_64.

x86_64/Rosetta (Legacy default, recommended for quick starts)

The default setup uses x86_64 Python running under Rosetta 2 on Apple Silicon. This is the most stable and compatible option.

Setup:

Install Miniconda (x86_64): Download from https://docs.conda.io/en/latest/miniconda.html (choose macOS Intel 64-bit)
Create environment: conda create -n codex python=3.11
Install dependencies: pip install -r requirements.txt

When to use:

Quick starts and one-off runs
When you need maximum compatibility
When ocr_only OCR strategy is sufficient

OCR Strategy:

Uses ocr_only (JAX unavailable under Rosetta, so hi_res not possible)
Note: If you're on Apple Silicon but using x86_64 Python, check if ARM64 environment exists and use that instead

Limitations:

Cannot use hi_res OCR strategy (requires JAX, which has AVX incompatibilities under Rosetta)
Slower performance (~3-5 minutes/page for OCR)
No GPU acceleration

ARM64 Native (Legacy recommended for heavy workloads)

For repeated processing or when you need hi_res OCR with table structure inference, use native ARM64 with JAX/Metal GPU acceleration.

Setup:

Install Miniforge (ARM64):

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh -b -p ~/miniforge3

Create ARM64 environment:

~/miniforge3/bin/conda create -n codex-arm python=3.11 -y
~/miniforge3/envs/codex-arm/bin/pip install -r requirements.txt

Install JAX with Metal support:

~/miniforge3/envs/codex-arm/bin/pip install jax-metal

Fix pdfminer compatibility (required for unstructured):

~/miniforge3/envs/codex-arm/bin/pip install "pdfminer.six==20240706"

Verify JAX/Metal:

~/miniforge3/envs/codex-arm/bin/python -c "import jax; print(jax.devices())"
# Should show: [METAL(id=0)]

Activation:

source ~/miniforge3/bin/activate
conda activate codex-arm

When to use:

Processing many PDFs regularly
Books with complex tables/layouts where hi_res helps
When you want GPU acceleration (2-5× faster than x86_64/Rosetta)
New machine/environment setup from scratch

OCR Strategy:

Recommended: hi_res (~15% faster, better element boundaries)
Fallback: ocr_only if needed

Performance:

hi_res OCR: ~88s/page (tested on M4 Pro, pages 16-18)
ocr_only OCR: ~105s/page (ARM64 native, no JAX)
Expected 2-5× speedup over x86_64/Rosetta for hi_res workloads

Known issues:

numpy version conflict: jax-metal requires numpy>=2.0, but unstructured requires numpy<2 (works despite warning)
pdfminer.six must be pinned to 20240706 for unstructured 0.16.9 compatibility

Rollback: Simply use your existing x86_64 environment. Miniforge and Miniconda can coexist.

Dev notes

Requires Tesseract installed/on PATH.
Models configurable; defaults use gpt-4.1-mini with --boost_model gpt-5.
Artifacts are JSON/JSONL; runs are append-only and reproducible via configs.
Driver unit tests run in CI via tests.yml. Run locally with:
```
python -m unittest discover -s tests -p "driver_*test.py"
```

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
.claude/commands		.claude/commands
.gemini-clipboard		.gemini-clipboard
configs		configs
constraints		constraints
docs		docs
modules		modules
prompts		prompts
scripts		scripts
testdata		testdata
tests		tests
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
Makefile		Makefile
README.md		README.md
analyze_extract.py		analyze_extract.py
conftest.py		conftest.py
driver.py		driver.py
example-prompt.md		example-prompt.md
ff-canonical		ff-canonical
pipeline_state.example.json		pipeline_state.example.json
requirements-legacy-easyocr.txt		requirements-legacy-easyocr.txt
requirements.txt		requirements.txt
schemas.py		schemas.py
settings.example.yaml		settings.example.yaml
settings.fast-intake.yaml		settings.fast-intake.yaml
settings.smoke.yaml		settings.smoke.yaml
snapshot.md		snapshot.md
testfile		testfile
todo.md		todo.md
validate_artifact.py		validate_artifact.py

copperdogma/codex-forge

Folders and files

Latest commit

History

Repository files navigation