A fully local, completely free RAG system with contextual retrieval. No API costs, no cloud dependencies, complete privacy.
- Zero Cost - No API fees, completely free forever
- 100% Private - Everything runs locally, your data never leaves your machine
- Works Offline - No internet required after initial setup
- 49% Better Retrieval - Contextual retrieval improves accuracy significantly
- Fast Setup - Get running in 15 minutes
- Modest Hardware - Runs on 4GB RAM minimum
This is a Retrieval-Augmented Generation (RAG) system that implements Contextual Retrieval using completely local, open-source tools. Upload documents, ask questions, and get AI-powered answers - all without sending your data to any cloud service or paying for API calls.
- Upload - Add your PDF or text documents
- Process - Documents are chunked and enriched with context using local LLM
- Store - Chunks are embedded and stored in local vector database
- Query - Ask questions and get AI-generated answers from your documents
- Privacy - Everything stays on your computer
- Python 3.10 or higher
- 4GB RAM minimum (8GB recommended)
- 5GB free disk space
macOS:
brew install ollamaLinux:
curl https://ollama.ai/install.sh | shWindows: Download from https://ollama.ai/download
# Start Ollama service (keep running)
ollama serve
# In another terminal, download Mistral 7B (~4GB)
ollama pull mistral# Clone repository
git clone <your-repo-url>
cd contextual-rag-local
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Make sure Ollama is running in another terminal
streamlit run app.pyOpen your browser to http://localhost:8501 and start uploading documents!
| Component | Technology | Why |
|---|---|---|
| LLM | Ollama + Mistral 7B | Fast, capable, runs on CPU |
| Embeddings | Sentence-Transformers | Local, no API needed |
| Vector DB | Chroma DB | Open-source, embedded |
| Search | BM25 + Semantic | Hybrid retrieval |
| UI | Streamlit | Simple, powerful |
| Cost | $0 | Everything is free |
contextual-rag-local/
├── app.py # Streamlit UI
├── config.py # Configuration
├── requirements.txt # Dependencies
│
├── src/
│ ├── document_processor.py # Document chunking
│ ├── context_generator.py # Context generation
│ ├── vector_db.py # Vector database
│ ├── retrieval.py # Hybrid retrieval
│ └── llm_interface.py # Ollama integration
│
├── data/
│ ├── sample_documents/ # Your documents
│ └── chroma_db/ # Vector storage
│
└── README.md
Edit config.py to customize:
# LLM Settings
OLLAMA_HOST = "http://localhost:11434"
CONTEXT_MODEL = "mistral" # Model for context generation
RESPONSE_MODEL = "mistral" # Model for answering
# Chunking
CHUNK_SIZE = 800 # Characters per chunk
CHUNK_OVERLAP = 0.2 # 20% overlap
# Retrieval
TOP_K_SEMANTIC = 20 # Semantic search results
TOP_K_BM25 = 20 # Keyword search results
TOP_K_FINAL = 10 # Final results to use
# Embeddings
EMBEDDING_MODEL = "all-MiniLM-L6-v2" # Local embedding modelCHUNK_SIZE = 400
TOP_K_FINAL = 5
CONTEXT_MODEL = "phi" # Smaller 2.7GB modelCHUNK_SIZE = 1200
TOP_K_FINAL = 15
CONTEXT_MODEL = "mistral"# Use larger, more capable model
ollama pull dolphin-mixtral
# Update config
CONTEXT_MODEL = "dolphin-mixtral"
RESPONSE_MODEL = "dolphin-mixtral"| Task | CPU Time | GPU Time |
|---|---|---|
| Context generation/chunk | 2-5s | 0.5-1s |
| Embedding generation | <1s | <1s |
| Semantic search | 100ms | 100ms |
| Response generation | 5-10s | 1-2s |
# Ollama isn't running - start it:
ollama serve# Download the model:
ollama pull mistral
# List installed models:
ollama list# In config.py, reduce:
CHUNK_SIZE = 400
CONTEXT_MODEL = "phi" # Smaller model- Use GPU for 10x speedup
- Use smaller model (phi: 2.7GB)
- Reduce chunk size
- Document Quality - Clean, well-formatted documents work best
- Chunk Size - Smaller chunks (400-600) for specific info, larger (800-1200) for context
- Model Selection - Mistral for balance, Phi for speed, Dolphin-Mixtral for quality (GPU)
- Query Formulation - Be specific in your questions for better results
| Solution | Setup | Monthly | Annual |
|---|---|---|---|
| This Project | Free | $0 | $0 |
| OpenAI API | Free | $100-500 | $1,200-6,000 |
| Enterprise RAG | Free | $500-5,000 | $6,000-60,000 |
Save $1,200-60,000/year by going local!
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Ollama - Local LLM runtime
- Chroma DB - Vector database
- Sentence-Transformers - Embeddings
- Streamlit - UI framework
If you find this project useful, please consider giving it a star!
Made with care for privacy-conscious developers
Zero cost. Zero tracking. Zero compromises.