into-md

A CLI tool that fetches web pages and converts them to clean markdown, optimized for providing context to LLMs.

Installation

# Global install with bun from npm
bun add -g into-md
# or
npm install -g into-md
# or
yarn global add into-md

Usage

into-md <url>

By default, into-md auto-detects whether a page needs a headless browser. It fetches with a static HTTP request first, inspects the result for SPA signals, and falls back to Playwright if needed.

Examples

# Auto-detect (default) — static fetch, falls back to headless if needed
into-md https://example.com/article

# Force headless browser rendering (skip auto-detect probe)
into-md https://spa-site.com/page --js

# Force static HTTP fetch (never launch a browser)
into-md https://example.com/page --no-js

# Skip content extraction, convert full page
into-md https://example.com --raw

# With authentication cookies
into-md https://private-site.com/page --cookies cookies.txt

# Verbose output (includes auto-detect decisions)
into-md https://example.com/article -v

Options

Flag	Description	Default
`-o, --output <file>`	Write output to file instead of stdout	stdout
`--js`	Force headless browser rendering (skip auto-detect)	auto-detect
`--no-js`	Force static HTTP fetch (never launch a browser)	auto-detect
`--raw`	Skip content extraction, convert entire HTML	disabled
`--cookies <file>`	Path to cookies file for authenticated requests	none
`--user-agent <string>`	Custom User-Agent header	browser-like UA
`--encoding <encoding>`	Force character encoding (auto-detected by default)	auto
`--strip-links`	Remove hyperlinks, keep only anchor text	disabled
`--exclude <selectors>`	CSS selectors to exclude (comma-separated)	none
`--timeout <ms>`	Request timeout in milliseconds	30000
`--no-cache`	Bypass response cache	cache enabled
`-v, --verbose`	Show detailed progress information	minimal
`-h, --help`	Show help	-
`--version`	Show version	-

--js and --no-js are mutually exclusive — passing both is an error.

Auto-Detect

When no rendering flag is passed, into-md runs a two-stage heuristic to decide whether the page needs a headless browser:

Stage 1 — Raw HTML inspection (before Readability extraction):

Checks for empty SPA root divs (#root, #app, #__next, #__nuxt, #__svelte)
Checks for <noscript> tags mentioning "javascript" combined with a near-empty body (< 100 chars of body text)

Stage 2 — Post-extraction content threshold (after Readability):

Falls back to headless if extracted text is < 200 characters and contains no structural HTML tags (<article>, <p>, <pre>, <li>, <h1>–<h6>)
Skipped when --raw is passed

Non-HTML responses (PDFs, images, etc.) skip auto-detect entirely and use static processing.

Output Format

Strategy Line

A strategy summary is printed to stderr on every run:

Strategy: static
Strategy: headless
Strategy: auto > static      # auto-detect chose static
Strategy: auto > headless     # auto-detect fell back to headless

Frontmatter

Standard metadata is included as YAML frontmatter:

---
title: "Article Title"
description: "Meta description from the page"
author: "Author Name"
date: "2024-01-15"
strategy: "auto>headless"
source: "https://example.com/article"
---

The strategy field records how the page was fetched: static, headless, auto>static, or auto>headless.

Tables

Tables are converted to fenced JSON blocks for reliable LLM parsing:

{
  "caption": "Quarterly Revenue",
  "headers": ["Quarter", "Revenue", "Growth"],
  "rows": [
    { "Quarter": "Q1", "Revenue": "$1.2M", "Growth": "12%" },
    { "Quarter": "Q2", "Revenue": "$1.5M", "Growth": "25%" }
  ]
}

Caching

Responses are cached in ~/.cache/into-md/ with a 1-hour TTL. Cache entries store the strategy used (static or headless), so auto-detect can skip re-probing on repeat visits. Use --no-cache to bypass.

When a forced flag (--js or --no-js) doesn't match the cached strategy, the cache is bypassed and the page is re-fetched.

Playwright & Browser Binaries

Playwright is a required dependency but browser binaries are downloaded on first use. If binaries are missing when needed:

Interactive terminal: prompts to run bunx playwright install chromium
Non-interactive / CI: exits with an error and instructions

Development

bun install              # Install dependencies
bun run build            # Build the CLI
bun run build:watch      # Build with watch mode
bun run test             # Run tests
bun run check            # Check for lint errors
bun run fix              # Auto-fix lint errors
bun run typecheck        # Type check

Tech Stack

Runtime: Bun
Language: TypeScript
HTML Parsing: cheerio
DOM: jsdom (auto-detect heuristics)
Markdown Conversion: turndown
Content Extraction: @mozilla/readability
Headless Browser: playwright
CLI Framework: commander

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.claude		.claude
.cursor		.cursor
.vscode		.vscode
docs		docs
src		src
website		website
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
alchemy.run.ts		alchemy.run.ts
biome.jsonc		biome.jsonc
bun.lock		bun.lock
package.json		package.json
tsconfig.json		tsconfig.json
tsdown.config.ts		tsdown.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

into-md

Installation

Usage

Examples

Options

Auto-Detect

Output Format

Strategy Line

Frontmatter

Tables

Caching

Playwright & Browser Binaries

Development

Tech Stack

License

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

nbbaier/into-md

Folders and files

Latest commit

History

Repository files navigation

into-md

Installation

Usage

Examples

Options

Auto-Detect

Output Format

Strategy Line

Frontmatter

Tables

Caching

Playwright & Browser Binaries

Development

Tech Stack

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages