docpull

Pull documentation from the web and convert to clean markdown. Perfect for building AI training data, local docs, or Claude Code skills.

Features

Fast - Parallel fetching with configurable workers
Simple - One command to pull entire documentation sites
Clean - Converts HTML to markdown with YAML frontmatter
Smart - Skips already-fetched files on re-runs
Ready - Pre-built fetchers for popular documentation sites

Quick Start

# Install
pip install docpull

# Pull documentation
docpull --source stripe --output-dir ./docs
docpull --source nextjs --output-dir ./docs

Supported Sources

Source	Description
`stripe`	Stripe API & payment documentation
`nextjs`	Next.js framework documentation
`plaid`	Plaid banking API documentation
`bun`	Bun runtime documentation
`d3`	D3.js data visualization library
`tailwind`	Tailwind CSS framework
`react`	React JavaScript library

Installation

# Basic installation
pip install docpull

# With YAML config support
pip install docpull[yaml]

Usage

Command Line

# Basic usage
docpull --source stripe --output-dir ./docs

# Multiple sources with config file
docpull --config config.yaml

# Custom rate limit (seconds between requests)
docpull --source nextjs --rate-limit 1.0

# Preview without downloading
docpull --source react --dry-run

Python API

from docpull import StripeFetcher

fetcher = StripeFetcher(
    output_dir="./docs",
    rate_limit=0.5,
    skip_existing=True
)
fetcher.fetch()

Configuration File

Create config.yaml:

output_dir: ./docs
rate_limit: 0.5
skip_existing: true
log_level: INFO

sources:
  - stripe
  - nextjs
  - react

Run with:

docpull --config config.yaml

Output Format

Each page is saved as markdown with YAML frontmatter:

---
url: https://stripe.com/docs/payments
fetched: 2025-11-07
---

# Payment Intents

Your documentation content here...

Files are organized by URL structure:

docs/
├── stripe/
│   ├── api/
│   │   ├── charges.md
│   │   └── customers.md
│   └── payments/
│       └── payment-intents.md
└── nextjs/
    ├── app/
    │   └── routing.md
    └── pages/
        └── api-routes.md

Creating Custom Fetchers

from docpull.fetchers.base import BaseFetcher

class MyDocsFetcher(BaseFetcher):
    def __init__(self, output_dir="./docs/mydocs", **kwargs):
        super().__init__(output_dir=output_dir, **kwargs)
        self.base_url = "https://docs.example.com"

    def fetch(self):
        urls = self.fetch_sitemap(f"{self.base_url}/sitemap.xml")
        for url in urls:
            output_path = self.url_to_filepath(url)
            self.process_url(url, output_path)

For parallel fetching, extend ParallelBaseFetcher instead. See examples for more.

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines on:

Adding new documentation sources
Reporting bugs and requesting features
Development setup and workflow

Documentation

Changelog - Version history
Security Policy - Reporting vulnerabilities
Support Guide - Getting help
Maintenance - Automated workflows

License

MIT License - see LICENSE file for details

Links

PyPI: pypi.org/project/docpull
GitHub: github.com/raintree-technology/docpull
Issues: Report a bug
Pair with: claude-starter - Claude Code template for building AI skills with docpull

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
docpull		docpull
docs		docs
tests		tests
.bandit		.bandit
.bumpversion.cfg		.bumpversion.cfg
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTOMATION_SETUP.md		AUTOMATION_SETUP.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MAINTENANCE.md		MAINTENANCE.md
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
RELEASE_CHECKLIST.md		RELEASE_CHECKLIST.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

docpull

Features

Quick Start

Supported Sources

Installation

Usage

Command Line

Python API

Configuration File

Output Format

Creating Custom Fetchers

Contributing

Documentation

License

Links

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

raintree-technology/docpull

Folders and files

Latest commit

History

Repository files navigation

docpull

Features

Quick Start

Supported Sources

Installation

Usage

Command Line

Python API

Configuration File

Output Format

Creating Custom Fetchers

Contributing

Documentation

License

Links

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages