Skip to content

prathmeshnik/RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GPU-Accelerated RAG System

A Retrieval-Augmented Generation (RAG) system that leverages GPU acceleration for efficient document processing, embedding generation, and semantic search using local LLMs like Ollama.

🚀 Overview

This system enables:

  • 🔍 Processing and embedding large PDF documents

  • 🧠 Semantic search for relevant content using GPU-accelerated embeddings

  • 💬 Human-like response generation via Ollama (e.g., dolphin-phi, gemma3:1b)

  • ⚡ Efficient runtime by leveraging NVIDIA GPU (VRAM-optimized)

📁 Directory Structure

rag-system/
├── pdf/                 # Place your PDF files here
├── database.py          # Script to build the vector database from PDFs
├── query.py             # Script to query the database and generate responses
├── embeddings.json      # Vector embeddings (generated automatically)
├── sentences.json       # Extracted sentences (generated automatically)
└── README.md            # Project overview and usage guide

📦 Installation

✅ Prerequisites

  • Python 3.8+

  • NVIDIA GPU with CUDA support

  • CUDA Toolkit 11.x or newer

🧪 Dependencies

Install the required packages:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install sentence-transformers requests PyMuPDF numpy tqdm

🛠 Setting up Ollama

  1. Download and install Ollama from ollama.ai

  2. Pull a local LLM model (e.g., Gemma):

ollama pull gemma3:1b
  1. Start the Ollama server:
ollama serve

▶️ Usage

Step 1: Build the Vector Database

  1. Place your PDF files in the pdf/ directory

  2. Run the database script:

python database.py

This will:

  • Extract text from PDFs

  • Generate sentence embeddings using GPU

  • Save embeddings.json and sentences.json

Step 2: Query the Database

Run the query script:

python query.py

You can then:

  • Type your query

  • View the most relevant content retrieved

  • Get a detailed, human-like answer generated by Ollama

⚙️ Performance Optimization

  • Embedding generation and similarity search are GPU-accelerated

  • Embeddings are stored in GPU VRAM to maximize speed

  • Batch processing used for better GPU utilization

  • Memory-friendly design for large document sets

📄 Files Description

  • database.py: Builds the vector database with GPU-accelerated embeddings

  • query.py: Accepts user queries, performs semantic search, and generates answers using Ollama

  • embeddings.json: Precomputed sentence embeddings

  • sentences.json: Original text from PDF documents

🛠 Troubleshooting

CUDA/GPU Issues

  • Check GPU status: nvidia-smi

  • Validate PyTorch CUDA support:

python -c "import torch; print(torch.cuda.is_available())"

Ollama Connection Issues

  • Ensure the Ollama server is running

  • Default API port: 11434

  • Check available models:

ollama list

Memory Issues

  • For large PDFs, reduce the batch size in database.py

  • Consider chunking content manually if you face memory limits

Built with 💡 curiosity and 🔥 passion for local AI!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages