Internet Archive → OCR (OpenAI Vision) → Markdown → TXT/EPUB → Edge TTS → M4B.
Features:
- Automates IA BookReader capture with Playwright (login/borrow flow, 2-up theater view).
- OCR via OpenAI Vision (gpt-4o) with concurrency/backoff to
content.json. - Exports clean Markdown and unwrapped TXT (no mid-sentence pauses in TTS).
- Synthesizes audiobook with Microsoft Edge TTS (via
aedocw/epub2tts-edge), cover embed. - Resume-safe:
SKIP_IA=1to reuse captured pages/OCR and just regenerate text/audio.
Important: Respect the Internet Archive’s Terms of Service. Borrow books legitimately and do not bypass protections.
- Node.js >= 18 and pnpm
- Python 3.12+ with
venv - ffmpeg, pandoc, jq
- Playwright Chromium (
pnpm playwright:install) - OpenAI API key (for OCR)
macOS (Homebrew):
brew install ffmpeg pandoc jq python
Linux (apt-based):
sudo apt-get update && sudo apt-get install -y ffmpeg pandoc jq python3-venv
- Install dependencies:
pnpm install
pnpm playwright:install
- Copy
.env.exampleto.envand set variables:
IA_EMAIL,IA_PASSWORD(IA account credentials)OPENAI_API_KEYIA_IDorIA_URL
- Run the full pipeline (capture → OCR → export → TTS):
IA_ID=<internet_archive_id> bin/ia2audio.sh
To resume from existing outputs and only rebuild TXT/EPUB/audio:
SKIP_IA=1 IA_ID=<internet_archive_id> bin/ia2audio.sh
Outputs are written under out/<IA_ID>/, including book.md, book.txt, book.epub, and audio/<IA_ID>.m4b (or .m4a).
- IA:
IA_ID,IA_URL,IA_EMAIL,IA_PASSWORD - Capture:
IA_START_PAGE(default 1),IA_RESUME(0/1),IA_MAX_PAGES,IA_PAGE_DELAY_MS,IA_TILE_STABLE_MS,IA_SPINNER_TIMEOUT_MS,IA_MAX_RETRIES - OCR:
OPENAI_API_KEY,OCR_CONCURRENCY,OCR_MAX_RETRIES - TTS:
VOICE(defaulten-US-AndrewNeural),EDGE_TTS_CONCURRENCY,EDGE_TTS_RETRIES,EDGE_TTS_BACKOFF - Orchestration:
SKIP_IA=1to reuse captured pages/OCR
See .env.example for a starter template.
-
Capture (
src/extract-ia-book.ts)- Logs in and tries multiple ‘Borrow’ variants.
- Forces 2-up + theater view to capture full spreads (inner element, no chrome).
- Waits for spinners to vanish, images to load, and a tile-stability window.
- Single-spread advance with verification to avoid double-turns.
- Change detection (tile signature + SHA1) prevents duplicate screenshots.
-
OCR (
src/transcribe-book-content.ts)- Sends each page image to OpenAI Vision (gpt-4o) with concurrency/backoff.
- Creates
content.jsonsorted by capture index then page.
-
Export (
src/export-book-markdown.ts)- Builds
metadata.jsonif missing; exports clean Markdown. - Folds single newlines, preserves paragraph breaks across spreads.
- Builds
-
TTS (
bin/ia2audio.sh)- Converts Markdown → TXT with
--wrap=none(prevents mid-sentence pauses). - Runs Edge TTS via
epub2tts-edgein a local virtualenv. - Embeds a cover image derived from the first page.
- Converts Markdown → TXT with
The wrapper will bootstrap epub2tts-edge automatically:
scripts/install_edge_tts.sh
It clones https://github.com/aedocw/epub2tts-edge and installs it into .venv.
- Stuck spinner or missing BookReader: increase
IA_SPINNER_TIMEOUT_MS; ensureview=theater&mode=2up. - Double page turns: the capture loop verifies a single advance and retries.
- Mid-sentence pauses: ensured TXT is unwrapped (pandoc
--wrap=none). - Resume at wrong page: set
IA_RESUME=0andIA_START_PAGE=1(default) to override IA resume.
Apache-2.0. See LICENSE.