Skip to content

nlongcn/generic-docs-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generic Documentation Crawler and RAG System

This project provides tools for creating a RAG (Retrieval Augmented Generation) system for any documentation website. The system consists of two main components:

  1. Generic Documentation Crawler (generic_docs_crawler.py) - Scrapes documentation from any specified URL and saves it in a structured format
  2. Generic Documentation Expert (generic_docs_expert.py) - Implements a RAG system using the scraped documentation

Additionally, a Streamlit UI (generic_docs_streamlit.py) is provided for easy interaction with the documentation.

Features

  • Works with any documentation website - just provide the URL
  • Automatically chunks content for better retrieval
  • Uses AI to generate titles and summaries for each chunk
  • Implements vector search for accurate document retrieval
  • Provides both command-line and Streamlit web interfaces
  • Saves documentation in reusable JSON format
  • Performance optimized with embedding caching and batch processing
  • Progress bars and detailed logging for better visibility
  • Recursive crawling for sites without sitemaps
  • Special handling for GitHub repositories documentation
  • Domain-specific embedding caching

Prerequisites

  • Python 3.11+
  • OpenAI API key (for embeddings and question answering)
  • Dependencies listed in requirements.txt

Installation

  1. Make sure you have all the dependencies installed:
pip install -r requirements.txt
  1. Install Playwright browsers (used by the crawler):
python -m playwright install
  1. Set up your OpenAI API key:
# Linux/macOS
export OPENAI_API_KEY=your-api-key-here

# Windows
set OPENAI_API_KEY=your-api-key-here

Or create a .env file with the following content:

OPENAI_API_KEY=your-api-key-here

Usage

1. Generic Documentation Crawler

To crawl a documentation website:

python generic_docs_crawler.py https://example-docs-site.com

Options:

  • --output or -o: Directory to save the output (default: "output")
  • --concurrency or -c: Maximum number of concurrent requests (default: 5)
  • --recursive or -r: Enable recursive crawling of all links (for sites without sitemaps)
  • --depth or -d: Maximum link depth for recursive crawling (default: 3)
  • --docs-only: Only follow documentation-related links (recommended for GitHub repositories)

Standard Crawling Mode

By default, the crawler will:

  1. Look for a sitemap.xml file to discover documentation pages
  2. If no sitemap is found, it will just crawl the base URL
  3. Extract content from each page
  4. Generate AI-powered titles and summaries
  5. Save the content in the output directory

Recursive Crawling Mode

When using the --recursive or -r flag, the crawler will:

  1. Start with the base URL
  2. Extract all links on the page that belong to the same domain
  3. Recursively follow those links up to the specified depth
  4. Process and save each page it discovers
  5. Avoid duplicate pages by tracking visited URLs

For GitHub repositories and other sites with extensive navigation, it's recommended to use the --docs-only flag to focus only on documentation pages.

GitHub Repository Support

The crawler has special handling for GitHub documentation repositories:

python generic_docs_crawler.py https://github.com/username/repo/tree/main/docs --recursive --docs-only

When crawling GitHub repositories, the system will:

  1. Detect the GitHub repository structure automatically
  2. Use the GitHub API to get a list of documentation files in Markdown format
  3. Fetch raw content directly from raw.githubusercontent.com for clean documentation without navigation elements
  4. Properly handle repository directory structures and file relationships

This is particularly useful for:

  • Documentation sites without sitemaps
  • GitHub repositories with documentation
  • Sites with interconnected documentation pages
  • Single-page applications where content is dynamically loaded

2. Generic Documentation Expert (CLI)

To use the RAG system via command line:

python generic_docs_expert.py --docs_dir output

Options:

  • --docs_dir: Directory containing the documentation (default: "output")
  • --domain: Specific domain to restrict queries to (improves performance and relevance)
  • --verbose or -v: Enable more detailed logging (useful for debugging)

This will start an interactive command-line interface where you can ask questions about the documentation.

3. Streamlit UI

For a more user-friendly interface:

streamlit run generic_docs_streamlit.py

The Streamlit UI provides:

  • Chat interface for asking questions about the documentation
  • Browse functionality to explore documentation sources
  • Search feature to find content by keyword
  • Domain-based navigation of documentation pages
  • Domain selection for focused queries
  • Cache management options

Performance Features

This system includes several performance optimizations to ensure fast and responsive interactions:

Embedding Caching

  • Document embeddings are cached to avoid repeated API calls
  • Cache is domain-specific and automatically loaded/saved between sessions
  • Dramatically improves response time for subsequent queries
  • UI includes cache management for clearing specific domain caches

Batch Processing

  • Document embeddings are processed in batches to reduce API calls
  • More efficient than processing each document individually

Progress Tracking

  • Progress bars show real-time status during lengthy operations
  • Time measurements indicate system performance

Logging

  • Detailed logs help track system behavior
  • Verbose mode provides additional information for debugging

Output Structure

The documentation is saved in the following structure:

output/
├── domain_name/
│   ├── index.json              # Index of all documentation pages
│   ├── page1_title_0.json      # First chunk of page1
│   ├── page1_title_1.json      # Second chunk of page1
│   ├── page2_title_0.json      # First chunk of page2
│   └── ...
└── ...

Each chunk file contains:

  • URL of the page
  • Title of the chunk
  • Summary of the chunk
  • Actual content
  • Metadata (chunk number, etc.)

Cache Structure

Embedding caches are stored in a .cache directory:

.cache/
└── embeddings_domain_name.pkl   # Domain-specific serialized embeddings cache

These cache files significantly improve performance on subsequent runs by avoiding redundant API calls.

Examples

Example 1: Crawling Python Documentation

python generic_docs_crawler.py https://docs.python.org/3/

Example 2: Crawling GitHub Repository Documentation

python generic_docs_crawler.py https://github.com/ollama/ollama/tree/main/docs --recursive --docs-only

Example 3: Asking a Question

python generic_docs_expert.py --docs_dir output/raw_githubusercontent_com

Then, at the prompt:

Question: How do I use the Ollama API?

Example 4: Testing Performance Improvement

To see the caching benefits in action, run:

python test_generic_expert.py

This will run a few sample questions and then repeat one to show the speed improvement.

Example 5: Using the Components Programmatically

You can also use the crawler and expert programmatically in your Python code:

Using the Crawler in Python Code

import asyncio
from openai import AsyncOpenAI
from generic_docs_crawler import recursive_crawl, process_and_save_document

async def crawl_documentation():
    # Initialize OpenAI client
    openai_client = AsyncOpenAI(api_key="your-api-key-here")
  
    # Crawl a documentation site
    base_url = "https://github.com/ollama/ollama/tree/main/docs"
    output_dir = "output"
    max_depth = 3
    max_concurrent = 5
  
    await recursive_crawl(base_url, output_dir, max_depth, max_concurrent)
  
    print("Crawl complete!")

# Run the crawl
if __name__ == "__main__":
    asyncio.run(crawl_documentation())

Using the Expert in Python Code

import asyncio
from openai import AsyncOpenAI
from generic_docs_expert import GenericDocsExpert

async def ask_question():
    # Initialize OpenAI client
    openai_client = AsyncOpenAI(api_key="your-api-key-here")
  
    # Initialize the expert
    docs_dir = "output/raw_githubusercontent_com"
    expert = GenericDocsExpert(openai_client, docs_dir)
  
    # Ask a question
    question = "How do I use the Ollama API?"
    answer = await expert.answer_question(question)
  
    print(f"Question: {question}")
    print(f"Answer: {answer}")

# Run the question-answering function
if __name__ == "__main__":
    asyncio.run(ask_question())

Extending the System

The generic documentation system is designed to be easily extensible:

  • Modify generic_docs_crawler.py to add support for different website structures
  • Extend generic_docs_expert.py with additional retrieval methods
  • Customize generic_docs_streamlit.py to add new visualizations or features
  • Adjust the caching strategy in EmbeddingCache class for different requirements

Troubleshooting

  • If the system seems slow on first run, this is normal as it creates embeddings for all documents
  • Check the .cache directory exists and has proper write permissions
  • For memory issues with large documentation sets, adjust the batch size in the retrieve_relevant_documentation method
  • When crawling GitHub repositories, ensure your IP address hasn't been rate-limited by GitHub's API

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

About

Docs crawler using crawl4AI and based on Cole Medin's agent

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published