GPU-Accelerated RAG System

A Retrieval-Augmented Generation (RAG) system that leverages GPU acceleration for efficient document processing, embedding generation, and semantic search using local LLMs like Ollama.

🚀 Overview

This system enables:

🔍 Processing and embedding large PDF documents
🧠 Semantic search for relevant content using GPU-accelerated embeddings
💬 Human-like response generation via Ollama (e.g., dolphin-phi, gemma3:1b)
⚡ Efficient runtime by leveraging NVIDIA GPU (VRAM-optimized)

📁 Directory Structure

rag-system/
├── pdf/                 # Place your PDF files here
├── database.py          # Script to build the vector database from PDFs
├── query.py             # Script to query the database and generate responses
├── embeddings.json      # Vector embeddings (generated automatically)
├── sentences.json       # Extracted sentences (generated automatically)
└── README.md            # Project overview and usage guide

📦 Installation

✅ Prerequisites

Python 3.8+
NVIDIA GPU with CUDA support
CUDA Toolkit 11.x or newer

🧪 Dependencies

Install the required packages:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install sentence-transformers requests PyMuPDF numpy tqdm

🛠 Setting up Ollama

Download and install Ollama from ollama.ai
Pull a local LLM model (e.g., Gemma):

ollama pull gemma3:1b

Start the Ollama server:

ollama serve

▶️ Usage

Step 1: Build the Vector Database

Place your PDF files in the pdf/ directory
Run the database script:

python database.py

This will:

Extract text from PDFs
Generate sentence embeddings using GPU
Save embeddings.json and sentences.json

Step 2: Query the Database

Run the query script:

python query.py

You can then:

Type your query
View the most relevant content retrieved
Get a detailed, human-like answer generated by Ollama

⚙️ Performance Optimization

Embedding generation and similarity search are GPU-accelerated
Embeddings are stored in GPU VRAM to maximize speed
Batch processing used for better GPU utilization
Memory-friendly design for large document sets

📄 Files Description

database.py: Builds the vector database with GPU-accelerated embeddings
query.py: Accepts user queries, performs semantic search, and generates answers using Ollama
embeddings.json: Precomputed sentence embeddings
sentences.json: Original text from PDF documents

🛠 Troubleshooting

CUDA/GPU Issues

Check GPU status: nvidia-smi
Validate PyTorch CUDA support:

python -c "import torch; print(torch.cuda.is_available())"

Ollama Connection Issues

Ensure the Ollama server is running
Default API port: 11434
Check available models:

ollama list

Memory Issues

For large PDFs, reduce the batch size in database.py
Consider chunking content manually if you face memory limits

Built with 💡 curiosity and 🔥 passion for local AI!

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
database.py		database.py
query-web-scrap.py		query-web-scrap.py
query.py		query.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU-Accelerated RAG System

🚀 Overview

📁 Directory Structure

📦 Installation

✅ Prerequisites

🧪 Dependencies

🛠 Setting up Ollama

▶️ Usage

Step 1: Build the Vector Database

Step 2: Query the Database

⚙️ Performance Optimization

📄 Files Description

🛠 Troubleshooting

CUDA/GPU Issues

Ollama Connection Issues

Memory Issues

About

Uh oh!

Releases

Packages

Languages

prathmeshnik/RAG

Folders and files

Latest commit

History

Repository files navigation

GPU-Accelerated RAG System

🚀 Overview

📁 Directory Structure

📦 Installation

✅ Prerequisites

🧪 Dependencies

🛠 Setting up Ollama

▶️ Usage

Step 1: Build the Vector Database

Step 2: Query the Database

⚙️ Performance Optimization

📄 Files Description

🛠 Troubleshooting

CUDA/GPU Issues

Ollama Connection Issues

Memory Issues

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages