Pull documentation from the web and convert to clean markdown. Perfect for building AI training data, local docs, or Claude Code skills.
- Fast - Parallel fetching with configurable workers
- Simple - One command to pull entire documentation sites
- Clean - Converts HTML to markdown with YAML frontmatter
- Smart - Skips already-fetched files on re-runs
- Ready - Pre-built fetchers for popular documentation sites
# Install
pip install docpull
# Pull documentation
docpull --source stripe --output-dir ./docs
docpull --source nextjs --output-dir ./docs| Source | Description |
|---|---|
stripe |
Stripe API & payment documentation |
nextjs |
Next.js framework documentation |
plaid |
Plaid banking API documentation |
bun |
Bun runtime documentation |
d3 |
D3.js data visualization library |
tailwind |
Tailwind CSS framework |
react |
React JavaScript library |
# Basic installation
pip install docpull
# With YAML config support
pip install docpull[yaml]# Basic usage
docpull --source stripe --output-dir ./docs
# Multiple sources with config file
docpull --config config.yaml
# Custom rate limit (seconds between requests)
docpull --source nextjs --rate-limit 1.0
# Preview without downloading
docpull --source react --dry-runfrom docpull import StripeFetcher
fetcher = StripeFetcher(
output_dir="./docs",
rate_limit=0.5,
skip_existing=True
)
fetcher.fetch()Create config.yaml:
output_dir: ./docs
rate_limit: 0.5
skip_existing: true
log_level: INFO
sources:
- stripe
- nextjs
- reactRun with:
docpull --config config.yamlEach page is saved as markdown with YAML frontmatter:
---
url: https://stripe.com/docs/payments
fetched: 2025-11-07
---
# Payment Intents
Your documentation content here...Files are organized by URL structure:
docs/
├── stripe/
│ ├── api/
│ │ ├── charges.md
│ │ └── customers.md
│ └── payments/
│ └── payment-intents.md
└── nextjs/
├── app/
│ └── routing.md
└── pages/
└── api-routes.md
from docpull.fetchers.base import BaseFetcher
class MyDocsFetcher(BaseFetcher):
def __init__(self, output_dir="./docs/mydocs", **kwargs):
super().__init__(output_dir=output_dir, **kwargs)
self.base_url = "https://docs.example.com"
def fetch(self):
urls = self.fetch_sitemap(f"{self.base_url}/sitemap.xml")
for url in urls:
output_path = self.url_to_filepath(url)
self.process_url(url, output_path)For parallel fetching, extend ParallelBaseFetcher instead. See examples for more.
We welcome contributions! See CONTRIBUTING.md for guidelines on:
- Adding new documentation sources
- Reporting bugs and requesting features
- Development setup and workflow
- Changelog - Version history
- Security Policy - Reporting vulnerabilities
- Support Guide - Getting help
- Maintenance - Automated workflows
MIT License - see LICENSE file for details
- PyPI: pypi.org/project/docpull
- GitHub: github.com/raintree-technology/docpull
- Issues: Report a bug
- Pair with: claude-starter - Claude Code template for building AI skills with docpull