Generic Documentation Crawler and RAG System

This project provides tools for creating a RAG (Retrieval Augmented Generation) system for any documentation website. The system consists of two main components:

Generic Documentation Crawler (generic_docs_crawler.py) - Scrapes documentation from any specified URL and saves it in a structured format
Generic Documentation Expert (generic_docs_expert.py) - Implements a RAG system using the scraped documentation

Additionally, a Streamlit UI (generic_docs_streamlit.py) is provided for easy interaction with the documentation.

Features

Works with any documentation website - just provide the URL
Automatically chunks content for better retrieval
Uses AI to generate titles and summaries for each chunk
Implements vector search for accurate document retrieval
Provides both command-line and Streamlit web interfaces
Saves documentation in reusable JSON format
Performance optimized with embedding caching and batch processing
Progress bars and detailed logging for better visibility
Recursive crawling for sites without sitemaps
Special handling for GitHub repositories documentation
Domain-specific embedding caching

Prerequisites

Python 3.11+
OpenAI API key (for embeddings and question answering)
Dependencies listed in requirements.txt

Installation

Make sure you have all the dependencies installed:

pip install -r requirements.txt

Install Playwright browsers (used by the crawler):

python -m playwright install

Set up your OpenAI API key:

# Linux/macOS
export OPENAI_API_KEY=your-api-key-here

# Windows
set OPENAI_API_KEY=your-api-key-here

Or create a .env file with the following content:

OPENAI_API_KEY=your-api-key-here

Usage

1. Generic Documentation Crawler

To crawl a documentation website:

python generic_docs_crawler.py https://example-docs-site.com

Options:

--output or -o: Directory to save the output (default: "output")
--concurrency or -c: Maximum number of concurrent requests (default: 5)
--recursive or -r: Enable recursive crawling of all links (for sites without sitemaps)
--depth or -d: Maximum link depth for recursive crawling (default: 3)
--docs-only: Only follow documentation-related links (recommended for GitHub repositories)

Standard Crawling Mode

By default, the crawler will:

Look for a sitemap.xml file to discover documentation pages
If no sitemap is found, it will just crawl the base URL
Extract content from each page
Generate AI-powered titles and summaries
Save the content in the output directory

Recursive Crawling Mode

When using the --recursive or -r flag, the crawler will:

Start with the base URL
Extract all links on the page that belong to the same domain
Recursively follow those links up to the specified depth
Process and save each page it discovers
Avoid duplicate pages by tracking visited URLs

For GitHub repositories and other sites with extensive navigation, it's recommended to use the --docs-only flag to focus only on documentation pages.

GitHub Repository Support

The crawler has special handling for GitHub documentation repositories:

python generic_docs_crawler.py https://github.com/username/repo/tree/main/docs --recursive --docs-only

When crawling GitHub repositories, the system will:

Detect the GitHub repository structure automatically
Use the GitHub API to get a list of documentation files in Markdown format
Fetch raw content directly from raw.githubusercontent.com for clean documentation without navigation elements
Properly handle repository directory structures and file relationships

This is particularly useful for:

Documentation sites without sitemaps
GitHub repositories with documentation
Sites with interconnected documentation pages
Single-page applications where content is dynamically loaded

2. Generic Documentation Expert (CLI)

To use the RAG system via command line:

python generic_docs_expert.py --docs_dir output

Options:

--docs_dir: Directory containing the documentation (default: "output")
--domain: Specific domain to restrict queries to (improves performance and relevance)
--verbose or -v: Enable more detailed logging (useful for debugging)

This will start an interactive command-line interface where you can ask questions about the documentation.

3. Streamlit UI

For a more user-friendly interface:

streamlit run generic_docs_streamlit.py

The Streamlit UI provides:

Chat interface for asking questions about the documentation
Browse functionality to explore documentation sources
Search feature to find content by keyword
Domain-based navigation of documentation pages
Domain selection for focused queries
Cache management options

Performance Features

This system includes several performance optimizations to ensure fast and responsive interactions:

Embedding Caching

Document embeddings are cached to avoid repeated API calls
Cache is domain-specific and automatically loaded/saved between sessions
Dramatically improves response time for subsequent queries
UI includes cache management for clearing specific domain caches

Batch Processing

Document embeddings are processed in batches to reduce API calls
More efficient than processing each document individually

Progress Tracking

Progress bars show real-time status during lengthy operations
Time measurements indicate system performance

Logging

Detailed logs help track system behavior
Verbose mode provides additional information for debugging

Output Structure

The documentation is saved in the following structure:

output/
├── domain_name/
│   ├── index.json              # Index of all documentation pages
│   ├── page1_title_0.json      # First chunk of page1
│   ├── page1_title_1.json      # Second chunk of page1
│   ├── page2_title_0.json      # First chunk of page2
│   └── ...
└── ...

Each chunk file contains:

URL of the page
Title of the chunk
Summary of the chunk
Actual content
Metadata (chunk number, etc.)

Cache Structure

Embedding caches are stored in a .cache directory:

.cache/
└── embeddings_domain_name.pkl   # Domain-specific serialized embeddings cache

These cache files significantly improve performance on subsequent runs by avoiding redundant API calls.

Examples

Example 1: Crawling Python Documentation

python generic_docs_crawler.py https://docs.python.org/3/

Example 2: Crawling GitHub Repository Documentation

python generic_docs_crawler.py https://github.com/ollama/ollama/tree/main/docs --recursive --docs-only

Example 3: Asking a Question

python generic_docs_expert.py --docs_dir output/raw_githubusercontent_com

Then, at the prompt:

Question: How do I use the Ollama API?

Example 4: Testing Performance Improvement

To see the caching benefits in action, run:

python test_generic_expert.py

This will run a few sample questions and then repeat one to show the speed improvement.

Example 5: Using the Components Programmatically

You can also use the crawler and expert programmatically in your Python code:

Using the Crawler in Python Code

import asyncio
from openai import AsyncOpenAI
from generic_docs_crawler import recursive_crawl, process_and_save_document

async def crawl_documentation():
    # Initialize OpenAI client
    openai_client = AsyncOpenAI(api_key="your-api-key-here")
  
    # Crawl a documentation site
    base_url = "https://github.com/ollama/ollama/tree/main/docs"
    output_dir = "output"
    max_depth = 3
    max_concurrent = 5
  
    await recursive_crawl(base_url, output_dir, max_depth, max_concurrent)
  
    print("Crawl complete!")

# Run the crawl
if __name__ == "__main__":
    asyncio.run(crawl_documentation())

Using the Expert in Python Code

import asyncio
from openai import AsyncOpenAI
from generic_docs_expert import GenericDocsExpert

async def ask_question():
    # Initialize OpenAI client
    openai_client = AsyncOpenAI(api_key="your-api-key-here")
  
    # Initialize the expert
    docs_dir = "output/raw_githubusercontent_com"
    expert = GenericDocsExpert(openai_client, docs_dir)
  
    # Ask a question
    question = "How do I use the Ollama API?"
    answer = await expert.answer_question(question)
  
    print(f"Question: {question}")
    print(f"Answer: {answer}")

# Run the question-answering function
if __name__ == "__main__":
    asyncio.run(ask_question())

Extending the System

The generic documentation system is designed to be easily extensible:

Modify generic_docs_crawler.py to add support for different website structures
Extend generic_docs_expert.py with additional retrieval methods
Customize generic_docs_streamlit.py to add new visualizations or features
Adjust the caching strategy in EmbeddingCache class for different requirements

Troubleshooting

If the system seems slow on first run, this is normal as it creates embeddings for all documents
Check the .cache directory exists and has proper write permissions
For memory issues with large documentation sets, adjust the batch size in the retrieve_relevant_documentation method
When crawling GitHub repositories, ensure your IP address hasn't been rate-limited by GitHub's API

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Based on https://github.com/coleam00/ottomator-agents/tree/main/crawl4AI-agent
Based on the Crawl4AI library for efficient web crawling
Utilizes OpenAI's embedding and LLM capabilities for high-quality information retrieval

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
crawl4AI-examples		crawl4AI-examples
n8n-version		n8n-version
studio-integration-version		studio-integration-version
.gitignore		.gitignore
README.md		README.md
crawl_pydantic_ai_docs.py		crawl_pydantic_ai_docs.py
generic_docs_crawler.py		generic_docs_crawler.py
generic_docs_crawler_README.md		generic_docs_crawler_README.md
generic_docs_expert.py		generic_docs_expert.py
generic_docs_streamlit.py		generic_docs_streamlit.py
pydantic_ai_expert.py		pydantic_ai_expert.py
requirements.txt		requirements.txt
setup_generic_crawler.bat		setup_generic_crawler.bat
setup_generic_crawler.sh		setup_generic_crawler.sh
site_pages.sql		site_pages.sql
streamlit_ui.py		streamlit_ui.py
test_generic_expert.py		test_generic_expert.py

nlongcn/generic-docs-crawler

Folders and files

Latest commit

History

Repository files navigation