Skip to content

RAGv2 an efficient Retrieval-Augmented Generation system designed specifically for running on consumer hardware

Notifications You must be signed in to change notification settings

prathmeshnik/RAGv2

Repository files navigation

RAG v2: Advanced Retrieval-Augmented Generation System

A comprehensive Retrieval-Augmented Generation (RAG) system that leverages large language models and vector databases to provide contextually relevant answers to user queries based on your documents. This system implements a parent-child chunking strategy for improved information retrieval and utilizes CUDA acceleration for optimal performance.

Table of Contents

Features

  • Advanced Document Ingestion: Supports multiple document formats (PDF, TXT, MD)
  • Hierarchical Chunking: Parent-child chunking strategy for better context retrieval
  • Vector Database: FAISS-powered similarity search for efficient retrieval
  • GPU Acceleration: CUDA support for embedding generation and LLM inference
  • Interactive Query Interface: Real-time querying with context-aware responses
  • Persistent Storage: Vector index and document storage for fast re-use
  • Memory Optimized: Automatic GPU memory management and model unloading

Architecture

The system follows a three-phase approach:

  1. Ingestion Phase: Documents are processed, chunked hierarchically, embedded, and stored in FAISS vector database
  2. Retrieval Phase: Queries are embedded and matched against stored vectors to find relevant context
  3. Generation Phase: Retrieved context is provided to the LLM to generate accurate responses

Parent-Child Chunking Strategy

  • Parent Chunks: Larger text segments (default: 2000 chars) for providing comprehensive context
  • Child Chunks: Smaller segments (default: 250 chars) for precise information retrieval
  • Overlap is used to maintain context continuity (250 chars for parent, 75 for child)

Installation

Prerequisites

  • Python 3.8+
  • CUDA-enabled GPU (for GPU acceleration)
  • Visual Studio Build Tools (for Windows compilation)
  • CMake (version 3.12 or higher)

Step-by-Step Setup

  1. Create a virtual environment:

    python -m venv rag
  2. Activate the virtual environment:

    rag/scripts/activate
  3. Set CUDA environment variables (Windows):

    $env:CMAKE_ARGS="-DGGML_CUDA=on"
    $env:CUDACXX="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin\nvcc.exe"
    $env:FORCE_CMAKE="1"
  4. Install PyTorch with CUDA support:

    pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
  5. Install llama-cpp-python with server support:

    pip install llama-cpp-python[server] --upgrade --no-cache-dir
  6. Install project dependencies:

    pip install -r requirements.txt

Model Setup

  1. Create a models directory in the project root
  2. Download the required models:
    • Embedding model: Qwen3-Embedding-4B-Q4_K_M.gguf in models/embedding/
    • LLM model: granite4.0_350M_16bit.gguf in models/
  3. Ensure models are placed in the correct subdirectories as specified in the configuration

Usage

The RAG system is divided into three main components:

1. Document Ingestion (rag_ingest.py)

Processes documents and creates the vector database:

python rag_ingest.py

This script will:

  • Load documents from the docs/ directory
  • Process and chunk documents using parent-child strategy
  • Generate embeddings using the embedding model
  • Create and save the FAISS vector index
  • Store document chunks for later retrieval

2. Complete RAG Pipeline (rag_complete.py)

Alternative ingestion script that combines loading and ingestion in one step:

python rag_complete.py

This script handles both document loading and the complete ingestion pipeline in a single execution.

3. Interactive Inference (rag_inference.py)

Query the system interactively:

python rag_inference.py

This starts an interactive session where you can:

  • Enter queries to search the document database
  • Receive contextually relevant responses from the LLM
  • Type exit or quit to end the session

Project Structure

RAGv2/
├── README.md                 # This file
├── Installation.md          # Installation instructions
├── rag_complete.py          # Complete RAG pipeline (ingestion + inference)
├── rag_inference.py         # Inference-only script for querying
├── rag_ingest.py            # Document ingestion and vector database creation
├── requirements.txt         # Python dependencies
├── data/                    # Generated data files
│   ├── vector_index.faiss   # FAISS vector database
│   ├── doc_store.pkl        # Document store (parent chunks)
│   └── child_nodes.pkl      # Child chunks metadata
├── docs/                    # Input documents to be processed
│   └── The Ultimate Guide to Fine-Tuning LLMs.pdf
├── models/                  # Model files
│   ├── embedding/           # Embedding models
│   │   └── Qwen3-Embedding-4B-Q4_K_M.gguf
│   └── granite4.0_350M_16bit.gguf  # Main LLM model
└── rag/                     # Additional RAG components 

Models Used

Embedding Model

  • Qwen3-Embedding-4B-Q4_K_M.gguf: Used for generating text embeddings for similarity search
  • Quantized for efficient processing while maintaining quality

Language Model

  • granite4.0_350M_16bit.gguf: Main LLM for generating responses based on retrieved context
  • Optimized for contextual question answering

Configuration

The system can be configured by modifying the constants in each script:

  • EMBEDDING_MODEL_PATH: Path to the embedding model
  • LLM_MODEL_PATH: Path to the language model
  • PARENT_CHUNK_SIZE: Size of parent text chunks (default: 2000)
  • CHILD_CHUNK_SIZE: Size of child text chunks (default: 250)
  • TOP_K: Number of relevant chunks to retrieve (default: 4)
  • DATA_DIR: Directory for storing vector database files

Requirements

The project requires the following Python packages:

llama-cpp-python==0.3.16
faiss-cpu
torch
numpy
fastapi
uvicorn
python-dotenv
pydantic
pydantic-settings
diskcache
pypdf

For optimal performance, especially with GPU acceleration:

  • CUDA-compatible GPU with at least 8GB VRAM
  • CUDA Toolkit (version 12.8 recommended)
  • Appropriate GPU drivers

How It Works

  1. Document Processing: Documents in the docs/ folder are loaded and split into hierarchical chunks
  2. Embedding Generation: Each chunk is converted to a numerical vector representation
  3. Index Creation: Vectors are stored in a FAISS index for fast similarity search
  4. Query Processing: User queries are embedded and compared against the vector database
  5. Context Retrieval: Most relevant document chunks are retrieved based on semantic similarity
  6. Response Generation: The LLM generates a response using the retrieved context as reference

Troubleshooting

Common Issues:

  1. CUDA Memory Errors: Reduce batch sizes or use low_vram=True flag in model configurations
  2. Model Loading Failures: Verify model paths and ensure models are properly downloaded
  3. Embedding Generation Issues: Check CUDA environment variables and GPU availability
  4. File Processing Errors: Ensure document files are not corrupted and have proper permissions

Performance Tips:

  • Use GPU acceleration for embedding generation and inference
  • Adjust chunk sizes based on document complexity and desired context length
  • Monitor GPU memory usage and adjust configurations accordingly

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Acknowledgments

  • Built with llama-cpp-python
  • Vector databases powered by FAISS
  • Inspired by the RAG architecture for enhanced language model capabilities

About

RAGv2 an efficient Retrieval-Augmented Generation system designed specifically for running on consumer hardware

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages