DeepSeek Agentic RAG

A hallucination-free AI system that automates extraction, validation, and natural language querying of DeepSeek's API documentation. By combining parallel web crawling (crawl4ai), vector storage (PostgreSQL with pgvector), and Pydantic AI's agentic capabilities, this system enables developers to interact with technical documentation conversationally while enforcing strict schema compliance to eliminate inaccuracies.

Overview

This project demonstrates the power of Agentic RAG using Pydantic AI, creating a reliable documentation assistant that:

Automatically crawls and validates DeepSeek's API documentation using parallel processing
Enforces strict schema compliance during information extraction and storage
Provides a hallucination-free chat interface powered by GPT-4o-mini
Uses autonomous reasoning with built-in accuracy constraints for API endpoint descriptions

What is Agentic RAG?

Agentic RAG extends traditional RAG systems by incorporating autonomous decision-making capabilities and schema validation:

Traditional RAG simply retrieves relevant documents and generates responses, risking hallucinations
Agentic RAG using Pydantic AI can:
- Enforce strict schema compliance for API documentation
- Validate information against known patterns and structures
- Autonomously decide what information to retrieve
- Chain multiple retrievals for complex technical queries
- Reason about the relevance and accuracy of retrieved information
- Dynamically adjust search strategies while maintaining accuracy
- Maintain conversation context across multiple turns

Architecture

The system consists of several key components:

1. Documentation Crawler (`crawl_deepseek_docs.py`)

Crawls DeepSeek's documentation using crawl4ai
Processes documentation into chunks
Generates embeddings using OpenAI's embedding model
Stores processed chunks in Supabase

2. Vector Database (`deepseek_pages.sql`)

PostgreSQL with pgvector extension
Stores documentation chunks with:
- Text content
- Embeddings
- Metadata
- URLs and titles
Provides similarity search functionality

3. Agent Framework (`deepseek_agent.py`)

Implements Agentic RAG logic using the Pydantic AI framework
Defines agent tools and behaviors for:
- Intelligent documentation retrieval
- Context-aware page listing
- Dynamic content fetching
- Autonomous reasoning about user queries
Uses OpenAI's models for embeddings and responses
Maintains conversation state and context

4. Web Interface (`streamlit_deepseek.py`)

Streamlit-based chat interface
Real-time streaming responses
Message history management
Clean and intuitive UI

Prerequisites

Python 3.8+
PostgreSQL with pgvector extension
Supabase account
OpenAI API key

Environment Variables

Create a .env file with:

OPENAI_API_KEY=your_openai_api_key
SUPABASE_URL=your_supabase_url
SUPABASE_SERVICE_KEY=your_supabase_service_key
LLM_MODEL=gpt-4o-mini  # or your preferred OpenAI model

Installation

Clone the repository:

gh repo clone aravpatel19/deepseek-agentic-rag
cd deepseek-agentic-rag

Install dependencies:

pip install -r requirements.txt

Set up the database:

Create a Supabase project
Run the SQL commands from deepseek_pages.sql

Usage

Crawl and process documentation:

python crawl_deepseek_docs.py

Start the Streamlit interface:

streamlit run streamlit_deepseek.py

Access the web interface at http://localhost:8501

How It Works

Documentation Processing:
- The crawler fetches documentation from DeepSeek's sitemap
- Content is split into manageable chunks
- Each chunk gets a title, summary, and embedding vector
- Chunks are stored in Supabase with metadata
Agentic Query Processing:
- User questions are analyzed by the Pydantic AI agent
- The agent autonomously decides on the retrieval strategy
- Questions are converted to embeddings
- Similar documentation chunks are retrieved
- The agent reasons about the relevance of retrieved information
- The LLM generates accurate answers based on the agent's analysis
- Responses are streamed in real-time
Vector Search:
- Uses cosine similarity to find relevant documentation
- Supports filtering by metadata
- Returns top matches for each query
- Agent can dynamically adjust search parameters based on context

Project Structure

.
├── README.md
├── crawl_deepseek_docs.py    # Documentation crawler
├── deepseek_agent.py         # RAG agent implementation
├── deepseek_pages.sql        # Database schema
├── streamlit_deepseek.py     # Web interface
└── .env                      # Environment variables

Dependencies

Key libraries and frameworks used:

pydantic-ai: Core framework for implementing the agentic RAG system
crawl4ai: Parallel web crawling with semantic filtering capabilities
openai: API access for embeddings (text-embedding-3-small) and LLM (GPT-4o-mini)
supabase: Vector database with pgvector for similarity search
streamlit: Web interface with real-time streaming
logfire: Optional logging configuration
asyncio: Asynchronous operations for improved performance
httpx: Modern HTTP client for async operations

Technical Details

Embedding System

Uses OpenAI's text-embedding-3-small model
1536-dimensional embedding vectors
Fallback to zero vector on embedding errors
Cosine similarity for vector search

Chunking Strategy

Intelligent text chunking with respect to:
- Code block boundaries (```)
- Paragraph breaks (\n\n)
- Sentence boundaries (. )
Default chunk size: 5000 characters
Minimum chunk threshold for quality control
Preserves code block integrity

Database Schema

PostgreSQL with pgvector extension
Optimized indexes for vector similarity search
JSON metadata for flexible filtering
Unique constraints on URL and chunk number
Row-level security enabled for Supabase integration

Error Handling

Graceful degradation for embedding failures
Comprehensive exception handling in crawler
Retry mechanism for agent operations (2 retries)
Detailed error logging and user feedback

Security Considerations

Environment Variables
- All sensitive credentials stored in .env
- API keys never exposed in the frontend
- Supabase RLS (Row Level Security) enabled
Database Access
- Read-only public access to documentation
- Protected write operations
- Metadata filtering for security boundaries

Performance Optimization

Parallel Processing
- Concurrent document crawling
- Parallel chunk processing
- Asynchronous database operations
Database Optimization
- IVFFlat index for vector search
- GIN index for metadata queries
- Optimized chunk size for retrieval

Deployment

Supabase Setup

Create a new Supabase project
Enable the pgvector extension
Run the schema from deepseek_pages.sql
Set up row-level security policies

Application Deployment

Set up environment variables
Install dependencies
Initialize the database
Run the crawler
Start the Streamlit server

Monitoring and Maintenance

Logging
- Crawler progress and errors
- Embedding generation status
- Database operation results
- Agent interaction logs
Regular Tasks
- Update documentation chunks
- Monitor embedding quality
- Check for API rate limits
- Verify database indexes

License

This project is licensed under the MIT License - see the LICENSE file for details.

Third-Party Licenses

Pydantic AI: Apache 2.0
Streamlit: Apache 2.0
OpenAI API: Proprietary
Supabase: Apache 2.0
crawl4ai: MIT

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

Development Guidelines

Follow PEP 8 style guide
Add docstrings for new functions
Include type hints
Write unit tests for new features
Update documentation as needed

Support

For support, please:

Check the existing documentation
Search for similar issues
Create a new issue with:
- Clear description
- Steps to reproduce
- Expected vs actual behavior

Acknowledgments

Built with Pydantic AI
Uses Streamlit for the web interface
Powered by OpenAI models
Database hosted on Supabase
Crawling powered by crawl4ai

Citation

If you use this project in your research or work, please cite:

@software{deepseek_agentic_rag,
  title = {DeepSeek Agentic RAG},
  author = {Arav Patel},
  year = {2024},
  description = {A hallucination-free AI system for DeepSeek API documentation},
  url = {https://github.com/aravpatel19/deepseek-agentic-rag}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.devcontainer		.devcontainer
__pycache__		__pycache__
api		api
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
crawl_deepseek_docs.py		crawl_deepseek_docs.py
deepseek_agent.py		deepseek_agent.py
deepseek_pages.sql		deepseek_pages.sql
package.json		package.json
requirements.txt		requirements.txt
setup_db.sql		setup_db.sql
streamlit_deepseek.py		streamlit_deepseek.py
tailwind.config.js		tailwind.config.js

License

aravpatel19/deepseek-agentic-rag

Folders and files

Latest commit

History

Repository files navigation

DeepSeek Agentic RAG

Overview

What is Agentic RAG?

Architecture

1. Documentation Crawler (crawl_deepseek_docs.py)

2. Vector Database (deepseek_pages.sql)

3. Agent Framework (deepseek_agent.py)

4. Web Interface (streamlit_deepseek.py)

Prerequisites

Environment Variables

Installation

Usage

How It Works

Project Structure

Dependencies

Technical Details

Embedding System

Chunking Strategy

Database Schema

Error Handling

Security Considerations

Performance Optimization

Deployment

Supabase Setup

Application Deployment

Monitoring and Maintenance

License

Third-Party Licenses

Contributing

Development Guidelines

Support

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. Documentation Crawler (`crawl_deepseek_docs.py`)

2. Vector Database (`deepseek_pages.sql`)

3. Agent Framework (`deepseek_agent.py`)

4. Web Interface (`streamlit_deepseek.py`)

Packages