A comprehensive, fully local RAG (Retrieval-Augmented Generation) pipeline for experimenting with different architectural parameters and understanding how each component affects performance. Built with Turkish legal text (KVKK) as the knowledge base.
This project is designed as a learning platform to:
- Understand RAG architecture deeply through hands-on experimentation
- Compare different embedding models (multilingual vs Turkish-specific)
- Experiment with quantization levels (FP32, FP16, INT8) and measure speed vs quality tradeoffs
- Test various chunking strategies and their impact on retrieval
- Compare retrieval techniques (basic, multi-query, compression, reranking)
- Evaluate local LLM performance with different quantization levels
- Compare RAG vs non-RAG approaches
┌─────────────────────────────────────────────────────────────┐
│ KVKK RAG Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. Document Loading (PDF → Pages) │
│ └─ PyPDFLoader with metadata preservation │
│ │
│ 2. Chunking (Pages → Chunks) │
│ ├─ Character Splitting (fixed size) │
│ ├─ Recursive Splitting (semantic-aware) │
│ └─ Semantic Chunking (embedding-based) │
│ │
│ 3. Embeddings (Chunks → Vectors) │
│ ├─ Multilingual: E5-base, Paraphrase-multilingual │
│ ├─ Turkish: Turkish-BERT, Turkish-NLI │
│ └─ Quantization: FP32 / FP16 / INT8 │
│ │
│ 4. Vector Store (Indexing) │
│ ├─ FAISS (fast, in-memory) │
│ └─ Chroma (persistent, hybrid search) │
│ │
│ 5. Retrieval (Query → Relevant Chunks) │
│ ├─ Basic similarity search │
│ ├─ Multi-query (multiple query variations) │
│ ├─ Contextual compression │
│ ├─ Reranking │
│ └─ Hybrid search (keyword + semantic) │
│ │
│ 6. Generation (Chunks + Query → Answer) │
│ └─ Local LLM via Ollama: │
│ ├─ Llama 3.1 8B │
│ ├─ Mistral 7B │
│ └─ Qwen 2.5 7B │
│ │
│ 7. Evaluation │
│ ├─ Retrieval metrics (precision, latency) │
│ ├─ Generation quality (keyword matching) │
│ └─ Baseline comparison (RAG vs full context) │
│ │
└─────────────────────────────────────────────────────────────┘
- No API costs - everything runs on your machine
- Complete privacy - no data leaves your computer
- Works offline after initial model downloads
- Easy parameter switching via YAML configs
- Mix and match components
- Systematic experimentation
- Retrieval speed and quality
- Generation performance
- Memory usage tracking
- Quantization impact measurement
- Clear code with extensive documentation
- Architectural decision explanations
- Comparative analysis tools
- Python 3.10+
- Ollama installed and running
- Clone or navigate to the project directory:
cd kvkk-rag-pipeline- Create virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Install Ollama and pull a model:
# Install from https://ollama.ai
# Then pull a model (if you don't have llama3 already):
ollama pull llama3:8b
# Or llama3.1 if you prefer: ollama pull llama3.1:8b- Add your KVKK PDF files to the
data/directory
python main.pyThis will:
- Load PDFs from
data/directory - Chunk documents using recursive strategy (512 chars, 50 overlap)
- Create embeddings with multilingual-e5-base (FP32)
- Build FAISS vector store
- Initialize Llama 3.1 8B via Ollama
- Run evaluation on test questions
python main.py --interactiveAsk questions about KVKK interactively.
python main.py --compare-embeddingsBenchmarks different quantization levels for embedding models.
python main.py --check-ollamaAll experiments are configured via YAML files in experiments/. Here's what you can configure:
document:
data_dir: data
file_pattern: "*.pdf"chunking:
strategy: recursive # Options: character, recursive, semantic
chunk_size: 512 # 256, 512, 1024, etc.
chunk_overlap: 50 # 0, 50, 100, 200, etc.Why this matters: Chunk size affects retrieval precision. Smaller chunks = more precise but less context. Larger chunks = more context but less precise.
embedding:
model_name: intfloat/multilingual-e5-base
# Options:
# - intfloat/multilingual-e5-base (multilingual)
# - sentence-transformers/paraphrase-multilingual-mpnet-base-v2
# - dbmdz/bert-base-turkish-cased (Turkish-specific)
# - emrecan/bert-base-turkish-cased-mean-nli-stsb-tr
quantization: fp32 # Options: fp32, fp16, int8
batch_size: 32Why this matters:
- Multilingual models work well across languages but may miss Turkish nuances
- Turkish-specific models may perform better on Turkish legal text
- Quantization trades quality for speed: FP16 is ~2x faster, INT8 is ~4x faster
vector_store:
store_type: faiss # Options: faiss, chroma
persist_dir: vector_storesWhy this matters:
- FAISS: Faster, in-memory, best for experimentation
- Chroma: Persistent, supports hybrid search, better for production
retrieval:
strategy: basic # Options: basic, multi_query, compression, rerank, hybrid
top_k: 5 # Number of chunks to retrieveWhy this matters:
- Basic: Fast, simple similarity search
- Multi-query: Generates query variations, better recall
- Compression: Retrieves more, compresses to relevant, better precision
- Rerank: Retrieve many, rerank, keep best
- Hybrid: Combines keyword + semantic search (Chroma only)
llm:
model: llama3:8b # Options: llama3:8b, llama3.1:8b, mistral:7b, qwen2.5:7b
temperature: 0.0 # 0 = deterministic, higher = more creative
max_tokens: 1024Why this matters:
- Llama 3.1: Balanced, good general performance
- Mistral: Fast, efficient
- Qwen 2.5: Excellent for non-English languages
- Temperature: 0 for factual answers, 0.7-1.0 for creative generation
baseline:
enabled: true
full_context: true # Use full document instead of RAGWhy this matters: Compare RAG vs passing the full document to see if chunking/retrieval actually helps.
Create configs with different chunk sizes:
experiments/chunk_256.yaml
experiment_name: chunk_size_256
chunking:
chunk_size: 256
chunk_overlap: 25experiments/chunk_1024.yaml
experiment_name: chunk_size_1024
chunking:
chunk_size: 1024
chunk_overlap: 100Run both:
python main.py --config experiments/chunk_256.yaml
python main.py --config experiments/chunk_1024.yamlCompare results in experiments/results/
Questions to explore:
- How does chunk size affect retrieval precision?
- Does overlap improve context preservation?
- What's the optimal size for legal text?
Test multilingual vs Turkish-specific:
# Config 1: Multilingual
embedding:
model_name: intfloat/multilingual-e5-base
quantization: fp32
# Config 2: Turkish
embedding:
model_name: dbmdz/bert-base-turkish-cased
quantization: fp32Questions to explore:
- Does Turkish-specific model improve retrieval for KVKK?
- How much better (if at all)?
- Is the improvement worth the reduced flexibility?
Test speed vs quality tradeoffs:
python main.py --compare-embeddingsOr create configs:
# FP32 (baseline)
embedding:
quantization: fp32
# FP16 (2x faster)
embedding:
quantization: fp16
# INT8 (4x faster)
embedding:
quantization: int8Questions to explore:
- How much speed improvement?
- How much quality degradation?
- What's the sweet spot for your use case?
# Try each strategy
retrieval:
strategy: basic
# Then: multi_query, compression, rerankQuestions to explore:
- Which strategy gives best precision?
- Which is fastest?
- Do advanced strategies justify the added complexity?
kvkk-rag-pipeline/
├── data/ # Put your KVKK PDFs here
├── embeddings/ # Downloaded embedding models
├── evaluation/
│ └── questions.yaml # Test questions
├── experiments/
│ ├── default_config.yaml # Default configuration
│ └── results/ # Evaluation results (JSON)
├── notebooks/ # Jupyter notebooks for exploration
├── src/
│ ├── config.py # Configuration management
│ ├── pipeline.py # Main pipeline orchestrator
│ ├── document_processing/
│ │ ├── loader.py # PDF loading
│ │ └── chunker.py # Text chunking strategies
│ ├── embeddings/
│ │ └── embedding_manager.py # Embeddings with quantization
│ ├── vector_stores/
│ │ └── vector_store_manager.py # FAISS and Chroma
│ ├── retrieval/
│ │ └── retrieval_manager.py # Retrieval strategies
│ ├── llm/
│ │ └── llm_manager.py # Ollama integration
│ └── evaluation/
│ └── evaluator.py # Evaluation framework
├── vector_stores/ # Persisted vector stores
├── main.py # Main entry point
├── requirements.txt
└── README.md
After running experiments, check experiments/results/*.json:
{
"config_name": "chunk_size_512",
"retrieval_metrics": {
"avg_retrieval_time_ms": 15.3,
"avg_keyword_match_rate": 0.85,
"avg_docs_retrieved": 5.0
},
"generation_metrics": {
"avg_generation_time_s": 3.2,
"avg_keyword_match_rate": 0.78,
"avg_answer_length": 234
}
}Key metrics to compare:
- Retrieval time: Lower is better (faster retrieval)
- Keyword match rate: Higher is better (more relevant retrieval/generation)
- Generation time: Lower is better, but quality matters more
- Answer length: Not necessarily better if longer, check quality manually
- Run default config first
- Understand the baseline performance
- Change ONE parameter at a time
- Compare results
- Week 1: Chunking experiments
- Week 2: Embedding comparisons
- Week 3: Retrieval strategies
- Week 4: LLM quantization
Document your findings:
- What worked well?
- What surprised you?
- What tradeoffs did you discover?
The code is heavily documented. Read through:
src/config.py- See all available optionssrc/pipeline.py- Understand the flow- Individual modules - Deep dive into each component
Create a notebook for interactive exploration:
from pathlib import Path
from src.config import ExperimentConfig
from src.pipeline import RAGPipeline
# Load config
config = ExperimentConfig.from_yaml(Path("experiments/default_config.yaml"))
# Create pipeline
pipeline = RAGPipeline(config)
pipeline.run_full_pipeline()
# Query
answer = pipeline.query("KVKK nedir?")
print(answer)
# Inspect retrievals
pipeline.retrieval_manager.print_retrieval_results("Veri sorumlusu kimdir?")Create a script to test multiple configurations:
from pathlib import Path
from src.config import ExperimentConfig, ChunkingConfig
from src.pipeline import RAGPipeline
chunk_sizes = [256, 512, 1024]
results = []
for size in chunk_sizes:
config = ExperimentConfig.from_yaml(Path("experiments/default_config.yaml"))
config.chunking.chunk_size = size
config.experiment_name = f"chunk_{size}"
pipeline = RAGPipeline(config)
pipeline.run_full_pipeline()
result = pipeline.evaluate()
results.append(result)
# Compare all results
from src.evaluation import RAGEvaluator
evaluator = RAGEvaluator(Path("experiments/results"))
evaluator.compare_configurations(results)- Make sure Ollama is installed: https://ollama.ai
- Start Ollama service
- Pull a model:
ollama pull llama3:8b
- Add PDF files to
data/directory - Check file pattern in config matches your files
- Reduce batch size in embedding config
- Use smaller embedding model
- Reduce chunk size
- Use INT8 quantization
- Use FAISS instead of Chroma
- Enable quantization (FP16 or INT8)
- Reduce
top_kin retrieval - Use smaller embedding model
This is a learning project. Feel free to:
- Add new chunking strategies
- Integrate new embedding models
- Implement additional retrieval techniques
- Add evaluation metrics
- Improve documentation
MIT License - Use freely for learning and experimentation.
- LangChain for the RAG framework
- Ollama for local LLM serving
- Sentence Transformers for embeddings
- The Turkish NLP community for Turkish models
Happy Learning! Remember: the goal is to understand, not just to run. Take time to experiment, observe, and learn from each configuration change.