Contextual Retrieval RAG - Local & Zero Cost

A fully local, completely free RAG system with contextual retrieval. No API costs, no cloud dependencies, complete privacy.

Features

Zero Cost - No API fees, completely free forever
100% Private - Everything runs locally, your data never leaves your machine
Works Offline - No internet required after initial setup
49% Better Retrieval - Contextual retrieval improves accuracy significantly
Fast Setup - Get running in 15 minutes
Modest Hardware - Runs on 4GB RAM minimum

What Is This?

This is a Retrieval-Augmented Generation (RAG) system that implements Contextual Retrieval using completely local, open-source tools. Upload documents, ask questions, and get AI-powered answers - all without sending your data to any cloud service or paying for API calls.

How It Works

Upload - Add your PDF or text documents
Process - Documents are chunked and enriched with context using local LLM
Store - Chunks are embedded and stored in local vector database
Query - Ask questions and get AI-generated answers from your documents
Privacy - Everything stays on your computer

Quick Start

Prerequisites

Python 3.10 or higher
4GB RAM minimum (8GB recommended)
5GB free disk space

1. Install Ollama

macOS:

brew install ollama

Linux:

curl https://ollama.ai/install.sh | sh

Windows: Download from https://ollama.ai/download

2. Download LLM Model

# Start Ollama service (keep running)
ollama serve

# In another terminal, download Mistral 7B (~4GB)
ollama pull mistral

3. Set Up Project

# Clone repository
git clone <your-repo-url>
cd contextual-rag-local

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

4. Run the Application

# Make sure Ollama is running in another terminal
streamlit run app.py

Open your browser to http://localhost:8501 and start uploading documents!

Tech Stack

Component	Technology	Why
LLM	Ollama + Mistral 7B	Fast, capable, runs on CPU
Embeddings	Sentence-Transformers	Local, no API needed
Vector DB	Chroma DB	Open-source, embedded
Search	BM25 + Semantic	Hybrid retrieval
UI	Streamlit	Simple, powerful
Cost	$0	Everything is free

Project Structure

contextual-rag-local/
├── app.py                      # Streamlit UI
├── config.py                   # Configuration
├── requirements.txt            # Dependencies
│
├── src/
│   ├── document_processor.py   # Document chunking
│   ├── context_generator.py    # Context generation
│   ├── vector_db.py            # Vector database
│   ├── retrieval.py            # Hybrid retrieval
│   └── llm_interface.py        # Ollama integration
│
├── data/
│   ├── sample_documents/       # Your documents
│   └── chroma_db/              # Vector storage
│
└── README.md

Configuration

Edit config.py to customize:

# LLM Settings
OLLAMA_HOST = "http://localhost:11434"
CONTEXT_MODEL = "mistral"       # Model for context generation
RESPONSE_MODEL = "mistral"      # Model for answering

# Chunking
CHUNK_SIZE = 800                # Characters per chunk
CHUNK_OVERLAP = 0.2             # 20% overlap

# Retrieval
TOP_K_SEMANTIC = 20             # Semantic search results
TOP_K_BM25 = 20                 # Keyword search results
TOP_K_FINAL = 10                # Final results to use

# Embeddings
EMBEDDING_MODEL = "all-MiniLM-L6-v2"  # Local embedding model

Hardware Optimization

Limited RAM (4GB)

CHUNK_SIZE = 400
TOP_K_FINAL = 5
CONTEXT_MODEL = "phi"  # Smaller 2.7GB model

Good CPU (8+ cores)

CHUNK_SIZE = 1200
TOP_K_FINAL = 15
CONTEXT_MODEL = "mistral"

GPU Available

# Use larger, more capable model
ollama pull dolphin-mixtral

# Update config
CONTEXT_MODEL = "dolphin-mixtral"
RESPONSE_MODEL = "dolphin-mixtral"

Performance

Task	CPU Time	GPU Time
Context generation/chunk	2-5s	0.5-1s
Embedding generation	<1s	<1s
Semantic search	100ms	100ms
Response generation	5-10s	1-2s

Troubleshooting

Connection Refused Error

# Ollama isn't running - start it:
ollama serve

Model Not Found

# Download the model:
ollama pull mistral

# List installed models:
ollama list

Out of Memory

# In config.py, reduce:
CHUNK_SIZE = 400
CONTEXT_MODEL = "phi"  # Smaller model

Slow Performance

Use GPU for 10x speedup
Use smaller model (phi: 2.7GB)
Reduce chunk size

Usage Tips

Document Quality - Clean, well-formatted documents work best
Chunk Size - Smaller chunks (400-600) for specific info, larger (800-1200) for context
Model Selection - Mistral for balance, Phi for speed, Dolphin-Mixtral for quality (GPU)
Query Formulation - Be specific in your questions for better results

Cost Comparison

Solution	Setup	Monthly	Annual
This Project	Free	$0	$0
OpenAI API	Free	$100-500	$1,200-6,000
Enterprise RAG	Free	$500-5,000	$6,000-60,000

Save $1,200-60,000/year by going local!

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Ollama - Local LLM runtime
Chroma DB - Vector database
Sentence-Transformers - Embeddings
Streamlit - UI framework

Resources

Star History

If you find this project useful, please consider giving it a star!

Made with care for privacy-conscious developers

Zero cost. Zero tracking. Zero compromises.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
app.py		app.py
config.py		config.py
pyproject.toml		pyproject.toml
temp_file		temp_file
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Contextual Retrieval RAG - Local & Zero Cost

Features

What Is This?

How It Works

Quick Start

Prerequisites

1. Install Ollama

2. Download LLM Model

3. Set Up Project

4. Run the Application

Tech Stack

Project Structure

Configuration

Hardware Optimization

Limited RAM (4GB)

Good CPU (8+ cores)

GPU Available

Performance

Troubleshooting

Connection Refused Error

Model Not Found

Out of Memory

Slow Performance

Usage Tips

Cost Comparison

Contributing

License

Acknowledgments

Resources

Star History

About

Uh oh!

Releases

Packages

Languages

License

Sergio-CVM00/contextual-rag-local

Folders and files

Latest commit

History

Repository files navigation

Contextual Retrieval RAG - Local & Zero Cost

Features

What Is This?

How It Works

Quick Start

Prerequisites

1. Install Ollama

2. Download LLM Model

3. Set Up Project

4. Run the Application

Tech Stack

Project Structure

Configuration

Hardware Optimization

Limited RAM (4GB)

Good CPU (8+ cores)

GPU Available

Performance

Troubleshooting

Connection Refused Error

Model Not Found

Out of Memory

Slow Performance

Usage Tips

Cost Comparison

Contributing

License

Acknowledgments

Resources

Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages