KVKK RAG Pipeline - Learning & Experimentation Platform

A comprehensive, fully local RAG (Retrieval-Augmented Generation) pipeline for experimenting with different architectural parameters and understanding how each component affects performance. Built with Turkish legal text (KVKK) as the knowledge base.

Purpose

This project is designed as a learning platform to:

Understand RAG architecture deeply through hands-on experimentation
Compare different embedding models (multilingual vs Turkish-specific)
Experiment with quantization levels (FP32, FP16, INT8) and measure speed vs quality tradeoffs
Test various chunking strategies and their impact on retrieval
Compare retrieval techniques (basic, multi-query, compression, reranking)
Evaluate local LLM performance with different quantization levels
Compare RAG vs non-RAG approaches

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    KVKK RAG Pipeline                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. Document Loading (PDF → Pages)                         │
│     └─ PyPDFLoader with metadata preservation              │
│                                                             │
│  2. Chunking (Pages → Chunks)                              │
│     ├─ Character Splitting (fixed size)                    │
│     ├─ Recursive Splitting (semantic-aware)                │
│     └─ Semantic Chunking (embedding-based)                 │
│                                                             │
│  3. Embeddings (Chunks → Vectors)                          │
│     ├─ Multilingual: E5-base, Paraphrase-multilingual      │
│     ├─ Turkish: Turkish-BERT, Turkish-NLI                  │
│     └─ Quantization: FP32 / FP16 / INT8                    │
│                                                             │
│  4. Vector Store (Indexing)                                │
│     ├─ FAISS (fast, in-memory)                             │
│     └─ Chroma (persistent, hybrid search)                  │
│                                                             │
│  5. Retrieval (Query → Relevant Chunks)                    │
│     ├─ Basic similarity search                             │
│     ├─ Multi-query (multiple query variations)             │
│     ├─ Contextual compression                              │
│     ├─ Reranking                                           │
│     └─ Hybrid search (keyword + semantic)                  │
│                                                             │
│  6. Generation (Chunks + Query → Answer)                   │
│     └─ Local LLM via Ollama:                               │
│        ├─ Llama 3.1 8B                                     │
│        ├─ Mistral 7B                                       │
│        └─ Qwen 2.5 7B                                      │
│                                                             │
│  7. Evaluation                                             │
│     ├─ Retrieval metrics (precision, latency)              │
│     ├─ Generation quality (keyword matching)               │
│     └─ Baseline comparison (RAG vs full context)           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key Features

Fully Local & Free

No API costs - everything runs on your machine
Complete privacy - no data leaves your computer
Works offline after initial model downloads

Modular & Configurable

Easy parameter switching via YAML configs
Mix and match components
Systematic experimentation

Comprehensive Metrics

Retrieval speed and quality
Generation performance
Memory usage tracking
Quantization impact measurement

Educational Focus

Clear code with extensive documentation
Architectural decision explanations
Comparative analysis tools

Installation

Prerequisites

Python 3.10+
Ollama installed and running

Setup

Clone or navigate to the project directory:

cd kvkk-rag-pipeline

Create virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Install Ollama and pull a model:

# Install from https://ollama.ai
# Then pull a model (if you don't have llama3 already):
ollama pull llama3:8b
# Or llama3.1 if you prefer: ollama pull llama3.1:8b

Add your KVKK PDF files to the data/ directory

Quick Start

1. Run with Default Configuration

python main.py

This will:

Load PDFs from data/ directory
Chunk documents using recursive strategy (512 chars, 50 overlap)
Create embeddings with multilingual-e5-base (FP32)
Build FAISS vector store
Initialize Llama 3.1 8B via Ollama
Run evaluation on test questions

2. Interactive Mode

python main.py --interactive

Ask questions about KVKK interactively.

3. Compare Embeddings

python main.py --compare-embeddings

Benchmarks different quantization levels for embedding models.

4. Check Ollama Status

python main.py --check-ollama

Configuration

All experiments are configured via YAML files in experiments/. Here's what you can configure:

Document Processing

document:
  data_dir: data
  file_pattern: "*.pdf"

Chunking Strategy

chunking:
  strategy: recursive  # Options: character, recursive, semantic
  chunk_size: 512     # 256, 512, 1024, etc.
  chunk_overlap: 50   # 0, 50, 100, 200, etc.

Why this matters: Chunk size affects retrieval precision. Smaller chunks = more precise but less context. Larger chunks = more context but less precise.

Embedding Model

embedding:
  model_name: intfloat/multilingual-e5-base
  # Options:
  #   - intfloat/multilingual-e5-base (multilingual)
  #   - sentence-transformers/paraphrase-multilingual-mpnet-base-v2
  #   - dbmdz/bert-base-turkish-cased (Turkish-specific)
  #   - emrecan/bert-base-turkish-cased-mean-nli-stsb-tr

  quantization: fp32  # Options: fp32, fp16, int8
  batch_size: 32

Why this matters:

Multilingual models work well across languages but may miss Turkish nuances
Turkish-specific models may perform better on Turkish legal text
Quantization trades quality for speed: FP16 is ~2x faster, INT8 is ~4x faster

Vector Store

vector_store:
  store_type: faiss  # Options: faiss, chroma
  persist_dir: vector_stores

Why this matters:

FAISS: Faster, in-memory, best for experimentation
Chroma: Persistent, supports hybrid search, better for production

Retrieval Strategy

retrieval:
  strategy: basic  # Options: basic, multi_query, compression, rerank, hybrid
  top_k: 5        # Number of chunks to retrieve

Why this matters:

Basic: Fast, simple similarity search
Multi-query: Generates query variations, better recall
Compression: Retrieves more, compresses to relevant, better precision
Rerank: Retrieve many, rerank, keep best
Hybrid: Combines keyword + semantic search (Chroma only)

LLM

llm:
  model: llama3:8b  # Options: llama3:8b, llama3.1:8b, mistral:7b, qwen2.5:7b
  temperature: 0.0  # 0 = deterministic, higher = more creative
  max_tokens: 1024

Why this matters:

Llama 3.1: Balanced, good general performance
Mistral: Fast, efficient
Qwen 2.5: Excellent for non-English languages
Temperature: 0 for factual answers, 0.7-1.0 for creative generation

Baseline Comparison

baseline:
  enabled: true
  full_context: true  # Use full document instead of RAG

Why this matters: Compare RAG vs passing the full document to see if chunking/retrieval actually helps.

Experimentation Guide

1. Chunking Experiments

Create configs with different chunk sizes:

experiments/chunk_256.yaml

experiment_name: chunk_size_256
chunking:
  chunk_size: 256
  chunk_overlap: 25

experiments/chunk_1024.yaml

experiment_name: chunk_size_1024
chunking:
  chunk_size: 1024
  chunk_overlap: 100

Run both:

python main.py --config experiments/chunk_256.yaml
python main.py --config experiments/chunk_1024.yaml

Compare results in experiments/results/

Questions to explore:

How does chunk size affect retrieval precision?
Does overlap improve context preservation?
What's the optimal size for legal text?

2. Embedding Model Comparison

Test multilingual vs Turkish-specific:

# Config 1: Multilingual
embedding:
  model_name: intfloat/multilingual-e5-base
  quantization: fp32

# Config 2: Turkish
embedding:
  model_name: dbmdz/bert-base-turkish-cased
  quantization: fp32

Questions to explore:

Does Turkish-specific model improve retrieval for KVKK?
How much better (if at all)?
Is the improvement worth the reduced flexibility?

3. Quantization Experiments

Test speed vs quality tradeoffs:

python main.py --compare-embeddings

Or create configs:

# FP32 (baseline)
embedding:
  quantization: fp32

# FP16 (2x faster)
embedding:
  quantization: fp16

# INT8 (4x faster)
embedding:
  quantization: int8

Questions to explore:

How much speed improvement?
How much quality degradation?
What's the sweet spot for your use case?

4. Retrieval Strategy Comparison

# Try each strategy
retrieval:
  strategy: basic
  # Then: multi_query, compression, rerank

Questions to explore:

Which strategy gives best precision?
Which is fastest?
Do advanced strategies justify the added complexity?

Project Structure

kvkk-rag-pipeline/
├── data/                          # Put your KVKK PDFs here
├── embeddings/                    # Downloaded embedding models
├── evaluation/
│   └── questions.yaml            # Test questions
├── experiments/
│   ├── default_config.yaml       # Default configuration
│   └── results/                  # Evaluation results (JSON)
├── notebooks/                     # Jupyter notebooks for exploration
├── src/
│   ├── config.py                 # Configuration management
│   ├── pipeline.py               # Main pipeline orchestrator
│   ├── document_processing/
│   │   ├── loader.py             # PDF loading
│   │   └── chunker.py            # Text chunking strategies
│   ├── embeddings/
│   │   └── embedding_manager.py  # Embeddings with quantization
│   ├── vector_stores/
│   │   └── vector_store_manager.py  # FAISS and Chroma
│   ├── retrieval/
│   │   └── retrieval_manager.py  # Retrieval strategies
│   ├── llm/
│   │   └── llm_manager.py        # Ollama integration
│   └── evaluation/
│       └── evaluator.py          # Evaluation framework
├── vector_stores/                 # Persisted vector stores
├── main.py                        # Main entry point
├── requirements.txt
└── README.md

Understanding the Results

After running experiments, check experiments/results/*.json:

{
  "config_name": "chunk_size_512",
  "retrieval_metrics": {
    "avg_retrieval_time_ms": 15.3,
    "avg_keyword_match_rate": 0.85,
    "avg_docs_retrieved": 5.0
  },
  "generation_metrics": {
    "avg_generation_time_s": 3.2,
    "avg_keyword_match_rate": 0.78,
    "avg_answer_length": 234
  }
}

Key metrics to compare:

Retrieval time: Lower is better (faster retrieval)
Keyword match rate: Higher is better (more relevant retrieval/generation)
Generation time: Lower is better, but quality matters more
Answer length: Not necessarily better if longer, check quality manually

Tips for Learning

Start Simple

Run default config first
Understand the baseline performance
Change ONE parameter at a time
Compare results

Focus on One Aspect

Week 1: Chunking experiments
Week 2: Embedding comparisons
Week 3: Retrieval strategies
Week 4: LLM quantization

Keep Notes

Document your findings:

What worked well?
What surprised you?
What tradeoffs did you discover?

Read the Code

The code is heavily documented. Read through:

src/config.py - See all available options
src/pipeline.py - Understand the flow
Individual modules - Deep dive into each component

Advanced Usage

Jupyter Notebooks

Create a notebook for interactive exploration:

from pathlib import Path
from src.config import ExperimentConfig
from src.pipeline import RAGPipeline

# Load config
config = ExperimentConfig.from_yaml(Path("experiments/default_config.yaml"))

# Create pipeline
pipeline = RAGPipeline(config)
pipeline.run_full_pipeline()

# Query
answer = pipeline.query("KVKK nedir?")
print(answer)

# Inspect retrievals
pipeline.retrieval_manager.print_retrieval_results("Veri sorumlusu kimdir?")

Parameter Sweeps

Create a script to test multiple configurations:

from pathlib import Path
from src.config import ExperimentConfig, ChunkingConfig
from src.pipeline import RAGPipeline

chunk_sizes = [256, 512, 1024]
results = []

for size in chunk_sizes:
    config = ExperimentConfig.from_yaml(Path("experiments/default_config.yaml"))
    config.chunking.chunk_size = size
    config.experiment_name = f"chunk_{size}"

    pipeline = RAGPipeline(config)
    pipeline.run_full_pipeline()

    result = pipeline.evaluate()
    results.append(result)

# Compare all results
from src.evaluation import RAGEvaluator
evaluator = RAGEvaluator(Path("experiments/results"))
evaluator.compare_configurations(results)

Troubleshooting

"Ollama is not available"

Make sure Ollama is installed: https://ollama.ai
Start Ollama service
Pull a model: ollama pull llama3:8b

"No PDF files found"

Add PDF files to data/ directory
Check file pattern in config matches your files

Out of Memory

Reduce batch size in embedding config
Use smaller embedding model
Reduce chunk size
Use INT8 quantization

Slow Performance

Use FAISS instead of Chroma
Enable quantization (FP16 or INT8)
Reduce top_k in retrieval
Use smaller embedding model

Learning Resources

RAG Fundamentals

Embeddings

Quantization

Contributing

This is a learning project. Feel free to:

Add new chunking strategies
Integrate new embedding models
Implement additional retrieval techniques
Add evaluation metrics
Improve documentation

License

MIT License - Use freely for learning and experimentation.

Acknowledgments

LangChain for the RAG framework
Ollama for local LLM serving
Sentence Transformers for embeddings
The Turkish NLP community for Turkish models

Happy Learning! Remember: the goal is to understand, not just to run. Take time to experiment, observe, and learn from each configuration change.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
evaluation		evaluation
experiments		experiments
notebooks		notebooks
src		src
vector_stores/faiss_index		vector_stores/faiss_index
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

KVKK RAG Pipeline - Learning & Experimentation Platform

Purpose

Architecture Overview

Key Features

Fully Local & Free

Modular & Configurable

Comprehensive Metrics

Educational Focus

Installation

Prerequisites

Setup

Quick Start

1. Run with Default Configuration

2. Interactive Mode

3. Compare Embeddings

4. Check Ollama Status

Configuration

Document Processing

Chunking Strategy

Embedding Model

Vector Store

Retrieval Strategy

LLM

Baseline Comparison

Experimentation Guide

1. Chunking Experiments

2. Embedding Model Comparison

3. Quantization Experiments

4. Retrieval Strategy Comparison

Project Structure

Understanding the Results

Tips for Learning

Start Simple

Focus on One Aspect

Keep Notes

Read the Code

Advanced Usage

Jupyter Notebooks

Parameter Sweeps

Troubleshooting

"Ollama is not available"

"No PDF files found"

Out of Memory

Slow Performance

Learning Resources

RAG Fundamentals

Embeddings

Quantization

Contributing

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages