A comprehensive Retrieval-Augmented Generation (RAG) system that leverages large language models and vector databases to provide contextually relevant answers to user queries based on your documents. This system implements a parent-child chunking strategy for improved information retrieval and utilizes CUDA acceleration for optimal performance.
- Features
- Architecture
- Installation
- Usage
- Project Structure
- Models Used
- Requirements
- Contributing
- License
- Advanced Document Ingestion: Supports multiple document formats (PDF, TXT, MD)
- Hierarchical Chunking: Parent-child chunking strategy for better context retrieval
- Vector Database: FAISS-powered similarity search for efficient retrieval
- GPU Acceleration: CUDA support for embedding generation and LLM inference
- Interactive Query Interface: Real-time querying with context-aware responses
- Persistent Storage: Vector index and document storage for fast re-use
- Memory Optimized: Automatic GPU memory management and model unloading
The system follows a three-phase approach:
- Ingestion Phase: Documents are processed, chunked hierarchically, embedded, and stored in FAISS vector database
- Retrieval Phase: Queries are embedded and matched against stored vectors to find relevant context
- Generation Phase: Retrieved context is provided to the LLM to generate accurate responses
- Parent Chunks: Larger text segments (default: 2000 chars) for providing comprehensive context
- Child Chunks: Smaller segments (default: 250 chars) for precise information retrieval
- Overlap is used to maintain context continuity (250 chars for parent, 75 for child)
- Python 3.8+
- CUDA-enabled GPU (for GPU acceleration)
- Visual Studio Build Tools (for Windows compilation)
- CMake (version 3.12 or higher)
-
Create a virtual environment:
python -m venv rag
-
Activate the virtual environment:
rag/scripts/activate
-
Set CUDA environment variables (Windows):
$env:CMAKE_ARGS="-DGGML_CUDA=on" $env:CUDACXX="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin\nvcc.exe" $env:FORCE_CMAKE="1"
-
Install PyTorch with CUDA support:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
-
Install llama-cpp-python with server support:
pip install llama-cpp-python[server] --upgrade --no-cache-dir
-
Install project dependencies:
pip install -r requirements.txt
- Create a
modelsdirectory in the project root - Download the required models:
- Embedding model:
Qwen3-Embedding-4B-Q4_K_M.ggufinmodels/embedding/ - LLM model:
granite4.0_350M_16bit.ggufinmodels/
- Embedding model:
- Ensure models are placed in the correct subdirectories as specified in the configuration
The RAG system is divided into three main components:
Processes documents and creates the vector database:
python rag_ingest.pyThis script will:
- Load documents from the
docs/directory - Process and chunk documents using parent-child strategy
- Generate embeddings using the embedding model
- Create and save the FAISS vector index
- Store document chunks for later retrieval
Alternative ingestion script that combines loading and ingestion in one step:
python rag_complete.pyThis script handles both document loading and the complete ingestion pipeline in a single execution.
Query the system interactively:
python rag_inference.pyThis starts an interactive session where you can:
- Enter queries to search the document database
- Receive contextually relevant responses from the LLM
- Type
exitorquitto end the session
RAGv2/
├── README.md # This file
├── Installation.md # Installation instructions
├── rag_complete.py # Complete RAG pipeline (ingestion + inference)
├── rag_inference.py # Inference-only script for querying
├── rag_ingest.py # Document ingestion and vector database creation
├── requirements.txt # Python dependencies
├── data/ # Generated data files
│ ├── vector_index.faiss # FAISS vector database
│ ├── doc_store.pkl # Document store (parent chunks)
│ └── child_nodes.pkl # Child chunks metadata
├── docs/ # Input documents to be processed
│ └── The Ultimate Guide to Fine-Tuning LLMs.pdf
├── models/ # Model files
│ ├── embedding/ # Embedding models
│ │ └── Qwen3-Embedding-4B-Q4_K_M.gguf
│ └── granite4.0_350M_16bit.gguf # Main LLM model
└── rag/ # Additional RAG components
- Qwen3-Embedding-4B-Q4_K_M.gguf: Used for generating text embeddings for similarity search
- Quantized for efficient processing while maintaining quality
- granite4.0_350M_16bit.gguf: Main LLM for generating responses based on retrieved context
- Optimized for contextual question answering
The system can be configured by modifying the constants in each script:
EMBEDDING_MODEL_PATH: Path to the embedding modelLLM_MODEL_PATH: Path to the language modelPARENT_CHUNK_SIZE: Size of parent text chunks (default: 2000)CHILD_CHUNK_SIZE: Size of child text chunks (default: 250)TOP_K: Number of relevant chunks to retrieve (default: 4)DATA_DIR: Directory for storing vector database files
The project requires the following Python packages:
llama-cpp-python==0.3.16
faiss-cpu
torch
numpy
fastapi
uvicorn
python-dotenv
pydantic
pydantic-settings
diskcache
pypdf
For optimal performance, especially with GPU acceleration:
- CUDA-compatible GPU with at least 8GB VRAM
- CUDA Toolkit (version 12.8 recommended)
- Appropriate GPU drivers
- Document Processing: Documents in the
docs/folder are loaded and split into hierarchical chunks - Embedding Generation: Each chunk is converted to a numerical vector representation
- Index Creation: Vectors are stored in a FAISS index for fast similarity search
- Query Processing: User queries are embedded and compared against the vector database
- Context Retrieval: Most relevant document chunks are retrieved based on semantic similarity
- Response Generation: The LLM generates a response using the retrieved context as reference
- CUDA Memory Errors: Reduce batch sizes or use
low_vram=Trueflag in model configurations - Model Loading Failures: Verify model paths and ensure models are properly downloaded
- Embedding Generation Issues: Check CUDA environment variables and GPU availability
- File Processing Errors: Ensure document files are not corrupted and have proper permissions
- Use GPU acceleration for embedding generation and inference
- Adjust chunk sizes based on document complexity and desired context length
- Monitor GPU memory usage and adjust configurations accordingly
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Built with llama-cpp-python
- Vector databases powered by FAISS
- Inspired by the RAG architecture for enhanced language model capabilities