This project provides tools for creating a RAG (Retrieval Augmented Generation) system for any documentation website. The system consists of two main components:
- Generic Documentation Crawler (
generic_docs_crawler.py) - Scrapes documentation from any specified URL and saves it in a structured format - Generic Documentation Expert (
generic_docs_expert.py) - Implements a RAG system using the scraped documentation
Additionally, a Streamlit UI (generic_docs_streamlit.py) is provided for easy interaction with the documentation.
- Works with any documentation website - just provide the URL
- Automatically chunks content for better retrieval
- Uses AI to generate titles and summaries for each chunk
- Implements vector search for accurate document retrieval
- Provides both command-line and Streamlit web interfaces
- Saves documentation in reusable JSON format
- Performance optimized with embedding caching and batch processing
- Progress bars and detailed logging for better visibility
- Recursive crawling for sites without sitemaps
- Special handling for GitHub repositories documentation
- Domain-specific embedding caching
- Python 3.11+
- OpenAI API key (for embeddings and question answering)
- Dependencies listed in
requirements.txt
- Make sure you have all the dependencies installed:
pip install -r requirements.txt- Install Playwright browsers (used by the crawler):
python -m playwright install- Set up your OpenAI API key:
# Linux/macOS
export OPENAI_API_KEY=your-api-key-here
# Windows
set OPENAI_API_KEY=your-api-key-hereOr create a .env file with the following content:
OPENAI_API_KEY=your-api-key-here
To crawl a documentation website:
python generic_docs_crawler.py https://example-docs-site.comOptions:
--outputor-o: Directory to save the output (default: "output")--concurrencyor-c: Maximum number of concurrent requests (default: 5)--recursiveor-r: Enable recursive crawling of all links (for sites without sitemaps)--depthor-d: Maximum link depth for recursive crawling (default: 3)--docs-only: Only follow documentation-related links (recommended for GitHub repositories)
By default, the crawler will:
- Look for a sitemap.xml file to discover documentation pages
- If no sitemap is found, it will just crawl the base URL
- Extract content from each page
- Generate AI-powered titles and summaries
- Save the content in the output directory
When using the --recursive or -r flag, the crawler will:
- Start with the base URL
- Extract all links on the page that belong to the same domain
- Recursively follow those links up to the specified depth
- Process and save each page it discovers
- Avoid duplicate pages by tracking visited URLs
For GitHub repositories and other sites with extensive navigation, it's recommended to use the --docs-only flag to focus only on documentation pages.
The crawler has special handling for GitHub documentation repositories:
python generic_docs_crawler.py https://github.com/username/repo/tree/main/docs --recursive --docs-onlyWhen crawling GitHub repositories, the system will:
- Detect the GitHub repository structure automatically
- Use the GitHub API to get a list of documentation files in Markdown format
- Fetch raw content directly from raw.githubusercontent.com for clean documentation without navigation elements
- Properly handle repository directory structures and file relationships
This is particularly useful for:
- Documentation sites without sitemaps
- GitHub repositories with documentation
- Sites with interconnected documentation pages
- Single-page applications where content is dynamically loaded
To use the RAG system via command line:
python generic_docs_expert.py --docs_dir outputOptions:
--docs_dir: Directory containing the documentation (default: "output")--domain: Specific domain to restrict queries to (improves performance and relevance)--verboseor-v: Enable more detailed logging (useful for debugging)
This will start an interactive command-line interface where you can ask questions about the documentation.
For a more user-friendly interface:
streamlit run generic_docs_streamlit.pyThe Streamlit UI provides:
- Chat interface for asking questions about the documentation
- Browse functionality to explore documentation sources
- Search feature to find content by keyword
- Domain-based navigation of documentation pages
- Domain selection for focused queries
- Cache management options
This system includes several performance optimizations to ensure fast and responsive interactions:
- Document embeddings are cached to avoid repeated API calls
- Cache is domain-specific and automatically loaded/saved between sessions
- Dramatically improves response time for subsequent queries
- UI includes cache management for clearing specific domain caches
- Document embeddings are processed in batches to reduce API calls
- More efficient than processing each document individually
- Progress bars show real-time status during lengthy operations
- Time measurements indicate system performance
- Detailed logs help track system behavior
- Verbose mode provides additional information for debugging
The documentation is saved in the following structure:
output/
├── domain_name/
│ ├── index.json # Index of all documentation pages
│ ├── page1_title_0.json # First chunk of page1
│ ├── page1_title_1.json # Second chunk of page1
│ ├── page2_title_0.json # First chunk of page2
│ └── ...
└── ...
Each chunk file contains:
- URL of the page
- Title of the chunk
- Summary of the chunk
- Actual content
- Metadata (chunk number, etc.)
Embedding caches are stored in a .cache directory:
.cache/
└── embeddings_domain_name.pkl # Domain-specific serialized embeddings cache
These cache files significantly improve performance on subsequent runs by avoiding redundant API calls.
python generic_docs_crawler.py https://docs.python.org/3/python generic_docs_crawler.py https://github.com/ollama/ollama/tree/main/docs --recursive --docs-onlypython generic_docs_expert.py --docs_dir output/raw_githubusercontent_comThen, at the prompt:
Question: How do I use the Ollama API?
To see the caching benefits in action, run:
python test_generic_expert.pyThis will run a few sample questions and then repeat one to show the speed improvement.
You can also use the crawler and expert programmatically in your Python code:
import asyncio
from openai import AsyncOpenAI
from generic_docs_crawler import recursive_crawl, process_and_save_document
async def crawl_documentation():
# Initialize OpenAI client
openai_client = AsyncOpenAI(api_key="your-api-key-here")
# Crawl a documentation site
base_url = "https://github.com/ollama/ollama/tree/main/docs"
output_dir = "output"
max_depth = 3
max_concurrent = 5
await recursive_crawl(base_url, output_dir, max_depth, max_concurrent)
print("Crawl complete!")
# Run the crawl
if __name__ == "__main__":
asyncio.run(crawl_documentation())import asyncio
from openai import AsyncOpenAI
from generic_docs_expert import GenericDocsExpert
async def ask_question():
# Initialize OpenAI client
openai_client = AsyncOpenAI(api_key="your-api-key-here")
# Initialize the expert
docs_dir = "output/raw_githubusercontent_com"
expert = GenericDocsExpert(openai_client, docs_dir)
# Ask a question
question = "How do I use the Ollama API?"
answer = await expert.answer_question(question)
print(f"Question: {question}")
print(f"Answer: {answer}")
# Run the question-answering function
if __name__ == "__main__":
asyncio.run(ask_question())The generic documentation system is designed to be easily extensible:
- Modify
generic_docs_crawler.pyto add support for different website structures - Extend
generic_docs_expert.pywith additional retrieval methods - Customize
generic_docs_streamlit.pyto add new visualizations or features - Adjust the caching strategy in
EmbeddingCacheclass for different requirements
- If the system seems slow on first run, this is normal as it creates embeddings for all documents
- Check the
.cachedirectory exists and has proper write permissions - For memory issues with large documentation sets, adjust the batch size in the retrieve_relevant_documentation method
- When crawling GitHub repositories, ensure your IP address hasn't been rate-limited by GitHub's API
This project is licensed under the MIT License - see the LICENSE file for details.
- Based on https://github.com/coleam00/ottomator-agents/tree/main/crawl4AI-agent
- Based on the Crawl4AI library for efficient web crawling
- Utilizes OpenAI's embedding and LLM capabilities for high-quality information retrieval