Skip to content

Pull documentation from the web and convert to clean markdown. Perfect for building AI training data, local docs, or Claude Code skills.

License

Notifications You must be signed in to change notification settings

raintree-technology/docpull

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

docpull

Pull documentation from the web and convert to clean markdown. Perfect for building AI training data, local docs, or Claude Code skills.

Python 3.9+ PyPI version Downloads Tests Security codecov License: MIT

Features

  • Fast - Parallel fetching with configurable workers
  • Simple - One command to pull entire documentation sites
  • Clean - Converts HTML to markdown with YAML frontmatter
  • Smart - Skips already-fetched files on re-runs
  • Ready - Pre-built fetchers for popular documentation sites

Quick Start

# Install
pip install docpull

# Pull documentation
docpull --source stripe --output-dir ./docs
docpull --source nextjs --output-dir ./docs

Supported Sources

Source Description
stripe Stripe API & payment documentation
nextjs Next.js framework documentation
plaid Plaid banking API documentation
bun Bun runtime documentation
d3 D3.js data visualization library
tailwind Tailwind CSS framework
react React JavaScript library

Installation

# Basic installation
pip install docpull

# With YAML config support
pip install docpull[yaml]

Usage

Command Line

# Basic usage
docpull --source stripe --output-dir ./docs

# Multiple sources with config file
docpull --config config.yaml

# Custom rate limit (seconds between requests)
docpull --source nextjs --rate-limit 1.0

# Preview without downloading
docpull --source react --dry-run

Python API

from docpull import StripeFetcher

fetcher = StripeFetcher(
    output_dir="./docs",
    rate_limit=0.5,
    skip_existing=True
)
fetcher.fetch()

Configuration File

Create config.yaml:

output_dir: ./docs
rate_limit: 0.5
skip_existing: true
log_level: INFO

sources:
  - stripe
  - nextjs
  - react

Run with:

docpull --config config.yaml

Output Format

Each page is saved as markdown with YAML frontmatter:

---
url: https://stripe.com/docs/payments
fetched: 2025-11-07
---

# Payment Intents

Your documentation content here...

Files are organized by URL structure:

docs/
├── stripe/
│   ├── api/
│   │   ├── charges.md
│   │   └── customers.md
│   └── payments/
│       └── payment-intents.md
└── nextjs/
    ├── app/
    │   └── routing.md
    └── pages/
        └── api-routes.md

Creating Custom Fetchers

from docpull.fetchers.base import BaseFetcher

class MyDocsFetcher(BaseFetcher):
    def __init__(self, output_dir="./docs/mydocs", **kwargs):
        super().__init__(output_dir=output_dir, **kwargs)
        self.base_url = "https://docs.example.com"

    def fetch(self):
        urls = self.fetch_sitemap(f"{self.base_url}/sitemap.xml")
        for url in urls:
            output_path = self.url_to_filepath(url)
            self.process_url(url, output_path)

For parallel fetching, extend ParallelBaseFetcher instead. See examples for more.

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines on:

  • Adding new documentation sources
  • Reporting bugs and requesting features
  • Development setup and workflow

Documentation

License

MIT License - see LICENSE file for details

Links

About

Pull documentation from the web and convert to clean markdown. Perfect for building AI training data, local docs, or Claude Code skills.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •