A Retrieval-Augmented Generation (RAG) system that leverages GPU acceleration for efficient document processing, embedding generation, and semantic search using local LLMs like Ollama.
This system enables:
-
🔍 Processing and embedding large PDF documents
-
🧠 Semantic search for relevant content using GPU-accelerated embeddings
-
💬 Human-like response generation via Ollama (e.g.,
dolphin-phi,gemma3:1b) -
⚡ Efficient runtime by leveraging NVIDIA GPU (VRAM-optimized)
rag-system/
├── pdf/ # Place your PDF files here
├── database.py # Script to build the vector database from PDFs
├── query.py # Script to query the database and generate responses
├── embeddings.json # Vector embeddings (generated automatically)
├── sentences.json # Extracted sentences (generated automatically)
└── README.md # Project overview and usage guide
-
Python 3.8+
-
NVIDIA GPU with CUDA support
-
CUDA Toolkit 11.x or newer
Install the required packages:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install sentence-transformers requests PyMuPDF numpy tqdm-
Download and install Ollama from ollama.ai
-
Pull a local LLM model (e.g., Gemma):
ollama pull gemma3:1b- Start the Ollama server:
ollama serve-
Place your PDF files in the
pdf/directory -
Run the database script:
python database.pyThis will:
-
Extract text from PDFs
-
Generate sentence embeddings using GPU
-
Save
embeddings.jsonandsentences.json
Run the query script:
python query.pyYou can then:
-
Type your query
-
View the most relevant content retrieved
-
Get a detailed, human-like answer generated by Ollama
-
Embedding generation and similarity search are GPU-accelerated
-
Embeddings are stored in GPU VRAM to maximize speed
-
Batch processing used for better GPU utilization
-
Memory-friendly design for large document sets
-
database.py: Builds the vector database with GPU-accelerated embeddings -
query.py: Accepts user queries, performs semantic search, and generates answers using Ollama -
embeddings.json: Precomputed sentence embeddings -
sentences.json: Original text from PDF documents
-
Check GPU status:
nvidia-smi -
Validate PyTorch CUDA support:
python -c "import torch; print(torch.cuda.is_available())"-
Ensure the Ollama server is running
-
Default API port:
11434 -
Check available models:
ollama list-
For large PDFs, reduce the batch size in
database.py -
Consider chunking content manually if you face memory limits
Built with 💡 curiosity and 🔥 passion for local AI!