A small, local PDF utility for repeatable document prep workflows (especially “scan → images → OCR-ready page images”). Built to be CLI-first, safe by default, and auditable via per-command JSON manifests.
This project is intentionally lightweight: once dependencies are installed, it runs fully offline.
Optional thin UI wrapper: pdf-toolkit-obsidian-plugin
(keeps the CLI as the contract; the plugin just calls the commands).
This started as a local-first alternative to subscription PDF tooling and untrusted freeware. I wanted a small, offline CLI pipeline for preparing scanned PDFs into OCR-ready page images (rotate, render, crop, split spreads) with deterministic outputs.
This tool will also make workflows:
- Predictable (deterministic naming + config precedence)
- Safe (explicit overwrite, dry-run, clear output locations)
- Hand-off friendly (JSON manifest records inputs/options/outputs/actions)
- Easy to integrate (CLI contract usable by scripts or a thin UI wrapper)
- Render PDF pages to PNGs (PyMuPDF)
- Split a PDF into multiple PDFs
- Rotate PDF pages or rotate PNGs (Pillow)
- Prepare “page-images”: split spread scans into single pages + crop page bounds (Pillow)
- Safe defaults with
--dry-runand--overwrite - JSON manifest written for each command (inputs/options/outputs + action log)
- Create and activate a virtual environment (optional but recommended):
python -m venv .venv
.venv\Scripts\Activate.ps1- Install dependencies:
pip install -r requirements.txt- Install in editable mode so
python -m pdf-toolkitworks:
pip install -e .If you prefer not to install it, you can temporarily set PYTHONPATH:
$env:PYTHONPATH = "src"See all commands:
python -m pdf-toolkit --helpRecommended pipeline (typical scan/OCR prep):
render -> page-images
Example:
python -m pdf-toolkit render --pdf "in.pdf" --out_dir "out\pages" --dpi 300 --format png --prefix "book"
python -m pdf-toolkit page-images --in_dir "out\pages" --out_dir "out\pages_single" --glob "*.png" --mode auto --debugpython -m pdf-toolkit render --pdf "in.pdf" --out_dir "out\pages" --dpi 300 --format png --prefix "book1"Dry-run (no files written):
python -m pdf-toolkit render --pdf "in.pdf" --out_dir "out\pages" --pages "1-10,15" --dry-runOutput naming is predictable:
book1_p0001.png, book1_p0002.png, etc.
Explicit ranges:
python -m pdf-toolkit split --pdf "in.pdf" --out_dir "out\splits" --ranges "1-120,121-240" --prefix "book"Automatic chunking:
python -m pdf-toolkit split --pdf "in.pdf" --out_dir "out\splits" --pages_per_file 120 --prefix "book"Outputs:
book_part01.pdf, book_part02.pdf, etc.
python -m pdf-toolkit rotate pdf --pdf "in.pdf" --out_pdf "in_rotated.pdf" --degrees 90 --pages "all"In-place (overwrites input):
python -m pdf-toolkit rotate pdf --pdf "in.pdf" --out_pdf "in.pdf" --degrees 180 --pages "1-5" --inplace --overwritepython -m pdf-toolkit rotate images --in_dir "out\pages" --glob "*.png" --degrees 90 --out_dir "out\pages_rot"In-place (overwrites files):
python -m pdf-toolkit rotate images --in_dir "out\pages" --glob "*.png" --degrees 90 --out_dir "out\pages" --inplace --overwriteAuto mode (split if wide enough, otherwise crop-only):
python -m pdf-toolkit page-images --in_dir "out\pages" --out_dir "out\pages_single" --glob "*.png" --mode auto --debugAlways split:
python -m pdf-toolkit page-images --in_dir "out\pages" --out_dir "out\pages_single" --mode split --overwriteNever split (crop-only):
python -m pdf-toolkit page-images --in_dir "out\pages" --out_dir "out\pages_single" --mode cropUseful tuning flags:
--gutter_trim_px: shave pixels from both sides of the detected gutter when splitting spreads--edge_inset_px: inset the final padded crop box inward to remove faint border noise
Example with both knobs:
python -m pdf-toolkit page-images --in_dir "out\pages" --out_dir "out\pages_single" --mode split --gutter_trim_px 20 --edge_inset_px 6 --debugDump the default YAML config:
python -m pdf-toolkit page-images --dump-default-configUse a config file:
python -m pdf-toolkit page-images --in_dir "out\pages" --out_dir "out\pages_single" --config "configs\page_images.default.yaml"Precedence is deterministic:
built-in defaults < YAML config < explicitly provided CLI flags
This means optional CLI defaults do not overwrite YAML values unless the flag is explicitly passed.
Supported YAML shapes:
Root form:
mode: auto
split_ratio: 1.25
crop_threshold: 180
pad_px: 20Wrapped form:
page_images:
mode: auto
split_ratio: 1.25
gutter_search_frac: 0.35
crop_threshold: 180
min_area_frac: 0.25Pages are 1-based for user input:
all1-101-10,15,20-25
Each command writes a JSON manifest describing:
- Inputs, outputs, options
- Actions taken (written, skipped, dry-run)
- Timestamps and logs
page-images action outputs list written files, plus split/crop metadata (e.g., gutter_x, bboxes, spread detection notes).
Example output list:
["out/pages_single/book_p0001_L.png", "out/pages_single/book_p0001_R.png"]By default the manifest is written to:
- Render:
out_dir\manifest.json - Split:
out_dir\manifest.json - Rotate PDF:
out_pdffolder\manifest.json - Rotate images:
out_dir\manifest.json - Page-images:
out_dir\manifest.json
Note: --dry-run skips writing the manifest (it is treated like an output file).
Run the minimal unit tests:
python -m unittest discover -s tests -p "test_*.py"