RAG v2: Advanced Retrieval-Augmented Generation System

A comprehensive Retrieval-Augmented Generation (RAG) system that leverages large language models and vector databases to provide contextually relevant answers to user queries based on your documents. This system implements a parent-child chunking strategy for improved information retrieval and utilizes CUDA acceleration for optimal performance.

Features

Advanced Document Ingestion: Supports multiple document formats (PDF, TXT, MD)
Hierarchical Chunking: Parent-child chunking strategy for better context retrieval
Vector Database: FAISS-powered similarity search for efficient retrieval
GPU Acceleration: CUDA support for embedding generation and LLM inference
Interactive Query Interface: Real-time querying with context-aware responses
Persistent Storage: Vector index and document storage for fast re-use
Memory Optimized: Automatic GPU memory management and model unloading

Architecture

The system follows a three-phase approach:

Ingestion Phase: Documents are processed, chunked hierarchically, embedded, and stored in FAISS vector database
Retrieval Phase: Queries are embedded and matched against stored vectors to find relevant context
Generation Phase: Retrieved context is provided to the LLM to generate accurate responses

Parent-Child Chunking Strategy

Parent Chunks: Larger text segments (default: 2000 chars) for providing comprehensive context
Child Chunks: Smaller segments (default: 250 chars) for precise information retrieval
Overlap is used to maintain context continuity (250 chars for parent, 75 for child)

Installation

Prerequisites

Python 3.8+
CUDA-enabled GPU (for GPU acceleration)
Visual Studio Build Tools (for Windows compilation)
CMake (version 3.12 or higher)

Step-by-Step Setup

Create a virtual environment:
```
python -m venv rag
```
Activate the virtual environment:
```
rag/scripts/activate
```

Set CUDA environment variables (Windows):

$env:CMAKE_ARGS="-DGGML_CUDA=on"
$env:CUDACXX="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin\nvcc.exe"
$env:FORCE_CMAKE="1"

Install PyTorch with CUDA support:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

Install llama-cpp-python with server support:

pip install llama-cpp-python[server] --upgrade --no-cache-dir

Install project dependencies:
```
pip install -r requirements.txt
```

Model Setup

Create a models directory in the project root
Download the required models:
- Embedding model: Qwen3-Embedding-4B-Q4_K_M.gguf in models/embedding/
- LLM model: granite4.0_350M_16bit.gguf in models/
Ensure models are placed in the correct subdirectories as specified in the configuration

Usage

The RAG system is divided into three main components:

1. Document Ingestion (`rag_ingest.py`)

Processes documents and creates the vector database:

python rag_ingest.py

This script will:

Load documents from the docs/ directory
Process and chunk documents using parent-child strategy
Generate embeddings using the embedding model
Create and save the FAISS vector index
Store document chunks for later retrieval

2. Complete RAG Pipeline (`rag_complete.py`)

Alternative ingestion script that combines loading and ingestion in one step:

python rag_complete.py

This script handles both document loading and the complete ingestion pipeline in a single execution.

3. Interactive Inference (`rag_inference.py`)

Query the system interactively:

python rag_inference.py

This starts an interactive session where you can:

Enter queries to search the document database
Receive contextually relevant responses from the LLM
Type exit or quit to end the session

Project Structure

RAGv2/
├── README.md                 # This file
├── Installation.md          # Installation instructions
├── rag_complete.py          # Complete RAG pipeline (ingestion + inference)
├── rag_inference.py         # Inference-only script for querying
├── rag_ingest.py            # Document ingestion and vector database creation
├── requirements.txt         # Python dependencies
├── data/                    # Generated data files
│   ├── vector_index.faiss   # FAISS vector database
│   ├── doc_store.pkl        # Document store (parent chunks)
│   └── child_nodes.pkl      # Child chunks metadata
├── docs/                    # Input documents to be processed
│   └── The Ultimate Guide to Fine-Tuning LLMs.pdf
├── models/                  # Model files
│   ├── embedding/           # Embedding models
│   │   └── Qwen3-Embedding-4B-Q4_K_M.gguf
│   └── granite4.0_350M_16bit.gguf  # Main LLM model
└── rag/                     # Additional RAG components

Models Used

Embedding Model

Qwen3-Embedding-4B-Q4_K_M.gguf: Used for generating text embeddings for similarity search
Quantized for efficient processing while maintaining quality

Language Model

granite4.0_350M_16bit.gguf: Main LLM for generating responses based on retrieved context
Optimized for contextual question answering

Configuration

The system can be configured by modifying the constants in each script:

EMBEDDING_MODEL_PATH: Path to the embedding model
LLM_MODEL_PATH: Path to the language model
PARENT_CHUNK_SIZE: Size of parent text chunks (default: 2000)
CHILD_CHUNK_SIZE: Size of child text chunks (default: 250)
TOP_K: Number of relevant chunks to retrieve (default: 4)
DATA_DIR: Directory for storing vector database files

Requirements

The project requires the following Python packages:

llama-cpp-python==0.3.16
faiss-cpu
torch
numpy
fastapi
uvicorn
python-dotenv
pydantic
pydantic-settings
diskcache
pypdf

For optimal performance, especially with GPU acceleration:

CUDA-compatible GPU with at least 8GB VRAM
CUDA Toolkit (version 12.8 recommended)
Appropriate GPU drivers

How It Works

Document Processing: Documents in the docs/ folder are loaded and split into hierarchical chunks
Embedding Generation: Each chunk is converted to a numerical vector representation
Index Creation: Vectors are stored in a FAISS index for fast similarity search
Query Processing: User queries are embedded and compared against the vector database
Context Retrieval: Most relevant document chunks are retrieved based on semantic similarity
Response Generation: The LLM generates a response using the retrieved context as reference

Troubleshooting

Common Issues:

CUDA Memory Errors: Reduce batch sizes or use low_vram=True flag in model configurations
Model Loading Failures: Verify model paths and ensure models are properly downloaded
Embedding Generation Issues: Check CUDA environment variables and GPU availability
File Processing Errors: Ensure document files are not corrupted and have proper permissions

Performance Tips:

Use GPU acceleration for embedding generation and inference
Adjust chunk sizes based on document complexity and desired context length
Monitor GPU memory usage and adjust configurations accordingly

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Acknowledgments

Built with llama-cpp-python
Vector databases powered by FAISS
Inspired by the RAG architecture for enhanced language model capabilities

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
.gitignore		.gitignore
README.md		README.md
rag_complete.py		rag_complete.py
rag_inference.py		rag_inference.py
rag_ingest.py		rag_ingest.py
requirements.txt		requirements.txt
test_cuda.py		test_cuda.py
test_inference.py		test_inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG v2: Advanced Retrieval-Augmented Generation System

Table of Contents

Features

Architecture

Parent-Child Chunking Strategy

Installation

Prerequisites

Step-by-Step Setup

Model Setup

Usage

1. Document Ingestion (`rag_ingest.py`)

2. Complete RAG Pipeline (`rag_complete.py`)

3. Interactive Inference (`rag_inference.py`)

Project Structure

Models Used

Embedding Model

Language Model

Configuration

Requirements

How It Works

Troubleshooting

Common Issues:

Performance Tips:

Contributing

Acknowledgments

About

Uh oh!

Releases 1

Packages

Languages

prathmeshnik/RAGv2

Folders and files

Latest commit

History

Repository files navigation

RAG v2: Advanced Retrieval-Augmented Generation System

Table of Contents

Features

Architecture

Parent-Child Chunking Strategy

Installation

Prerequisites

Step-by-Step Setup

Model Setup

Usage

1. Document Ingestion (rag_ingest.py)

2. Complete RAG Pipeline (rag_complete.py)

3. Interactive Inference (rag_inference.py)

Project Structure

Models Used

Embedding Model

Language Model

Configuration

Requirements

How It Works

Troubleshooting

Common Issues:

Performance Tips:

Contributing

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

1. Document Ingestion (`rag_ingest.py`)

2. Complete RAG Pipeline (`rag_complete.py`)

3. Interactive Inference (`rag_inference.py`)

Packages