A hallucination-free AI system that automates extraction, validation, and natural language querying of DeepSeek's API documentation. By combining parallel web crawling (crawl4ai), vector storage (PostgreSQL with pgvector), and Pydantic AI's agentic capabilities, this system enables developers to interact with technical documentation conversationally while enforcing strict schema compliance to eliminate inaccuracies.
This project demonstrates the power of Agentic RAG using Pydantic AI, creating a reliable documentation assistant that:
- Automatically crawls and validates DeepSeek's API documentation using parallel processing
- Enforces strict schema compliance during information extraction and storage
- Provides a hallucination-free chat interface powered by GPT-4o-mini
- Uses autonomous reasoning with built-in accuracy constraints for API endpoint descriptions
Agentic RAG extends traditional RAG systems by incorporating autonomous decision-making capabilities and schema validation:
- Traditional RAG simply retrieves relevant documents and generates responses, risking hallucinations
- Agentic RAG using Pydantic AI can:
- Enforce strict schema compliance for API documentation
- Validate information against known patterns and structures
- Autonomously decide what information to retrieve
- Chain multiple retrievals for complex technical queries
- Reason about the relevance and accuracy of retrieved information
- Dynamically adjust search strategies while maintaining accuracy
- Maintain conversation context across multiple turns
The system consists of several key components:
- Crawls DeepSeek's documentation using
crawl4ai - Processes documentation into chunks
- Generates embeddings using OpenAI's embedding model
- Stores processed chunks in Supabase
- PostgreSQL with pgvector extension
- Stores documentation chunks with:
- Text content
- Embeddings
- Metadata
- URLs and titles
- Provides similarity search functionality
- Implements Agentic RAG logic using the Pydantic AI framework
- Defines agent tools and behaviors for:
- Intelligent documentation retrieval
- Context-aware page listing
- Dynamic content fetching
- Autonomous reasoning about user queries
- Uses OpenAI's models for embeddings and responses
- Maintains conversation state and context
- Streamlit-based chat interface
- Real-time streaming responses
- Message history management
- Clean and intuitive UI
- Python 3.8+
- PostgreSQL with pgvector extension
- Supabase account
- OpenAI API key
Create a .env file with:
OPENAI_API_KEY=your_openai_api_key
SUPABASE_URL=your_supabase_url
SUPABASE_SERVICE_KEY=your_supabase_service_key
LLM_MODEL=gpt-4o-mini # or your preferred OpenAI model
- Clone the repository:
gh repo clone aravpatel19/deepseek-agentic-rag
cd deepseek-agentic-rag- Install dependencies:
pip install -r requirements.txt- Set up the database:
- Create a Supabase project
- Run the SQL commands from
deepseek_pages.sql
- Crawl and process documentation:
python crawl_deepseek_docs.py- Start the Streamlit interface:
streamlit run streamlit_deepseek.py- Access the web interface at
http://localhost:8501
-
Documentation Processing:
- The crawler fetches documentation from DeepSeek's sitemap
- Content is split into manageable chunks
- Each chunk gets a title, summary, and embedding vector
- Chunks are stored in Supabase with metadata
-
Agentic Query Processing:
- User questions are analyzed by the Pydantic AI agent
- The agent autonomously decides on the retrieval strategy
- Questions are converted to embeddings
- Similar documentation chunks are retrieved
- The agent reasons about the relevance of retrieved information
- The LLM generates accurate answers based on the agent's analysis
- Responses are streamed in real-time
-
Vector Search:
- Uses cosine similarity to find relevant documentation
- Supports filtering by metadata
- Returns top matches for each query
- Agent can dynamically adjust search parameters based on context
.
├── README.md
├── crawl_deepseek_docs.py # Documentation crawler
├── deepseek_agent.py # RAG agent implementation
├── deepseek_pages.sql # Database schema
├── streamlit_deepseek.py # Web interface
└── .env # Environment variables
Key libraries and frameworks used:
pydantic-ai: Core framework for implementing the agentic RAG systemcrawl4ai: Parallel web crawling with semantic filtering capabilitiesopenai: API access for embeddings (text-embedding-3-small) and LLM (GPT-4o-mini)supabase: Vector database with pgvector for similarity searchstreamlit: Web interface with real-time streaminglogfire: Optional logging configurationasyncio: Asynchronous operations for improved performancehttpx: Modern HTTP client for async operations
- Uses OpenAI's
text-embedding-3-smallmodel - 1536-dimensional embedding vectors
- Fallback to zero vector on embedding errors
- Cosine similarity for vector search
- Intelligent text chunking with respect to:
- Code block boundaries (```)
- Paragraph breaks (\n\n)
- Sentence boundaries (. )
- Default chunk size: 5000 characters
- Minimum chunk threshold for quality control
- Preserves code block integrity
- PostgreSQL with pgvector extension
- Optimized indexes for vector similarity search
- JSON metadata for flexible filtering
- Unique constraints on URL and chunk number
- Row-level security enabled for Supabase integration
- Graceful degradation for embedding failures
- Comprehensive exception handling in crawler
- Retry mechanism for agent operations (2 retries)
- Detailed error logging and user feedback
-
Environment Variables
- All sensitive credentials stored in
.env - API keys never exposed in the frontend
- Supabase RLS (Row Level Security) enabled
- All sensitive credentials stored in
-
Database Access
- Read-only public access to documentation
- Protected write operations
- Metadata filtering for security boundaries
-
Parallel Processing
- Concurrent document crawling
- Parallel chunk processing
- Asynchronous database operations
-
Database Optimization
- IVFFlat index for vector search
- GIN index for metadata queries
- Optimized chunk size for retrieval
- Create a new Supabase project
- Enable the pgvector extension
- Run the schema from
deepseek_pages.sql - Set up row-level security policies
- Set up environment variables
- Install dependencies
- Initialize the database
- Run the crawler
- Start the Streamlit server
-
Logging
- Crawler progress and errors
- Embedding generation status
- Database operation results
- Agent interaction logs
-
Regular Tasks
- Update documentation chunks
- Monitor embedding quality
- Check for API rate limits
- Verify database indexes
This project is licensed under the MIT License - see the LICENSE file for details.
- Pydantic AI: Apache 2.0
- Streamlit: Apache 2.0
- OpenAI API: Proprietary
- Supabase: Apache 2.0
- crawl4ai: MIT
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
- Follow PEP 8 style guide
- Add docstrings for new functions
- Include type hints
- Write unit tests for new features
- Update documentation as needed
For support, please:
- Check the existing documentation
- Search for similar issues
- Create a new issue with:
- Clear description
- Steps to reproduce
- Expected vs actual behavior
- Built with Pydantic AI
- Uses Streamlit for the web interface
- Powered by OpenAI models
- Database hosted on Supabase
- Crawling powered by crawl4ai
If you use this project in your research or work, please cite:
@software{deepseek_agentic_rag,
title = {DeepSeek Agentic RAG},
author = {Arav Patel},
year = {2024},
description = {A hallucination-free AI system for DeepSeek API documentation},
url = {https://github.com/aravpatel19/deepseek-agentic-rag}
}