Skip to content

Tiny, reproducible evaluation harness for RAG systems (recall@k, MRR)

Notifications You must be signed in to change notification settings

savinoo/rag-eval-harness

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Eval Harness

Tiny, reproducible evaluation harness for RAG systems (golden set + metrics).

What it does

  • Runs a golden dataset (JSONL)
  • Computes retrieval metrics: recall@k, MRR
  • Produces a shareable report: report.json, report.md (+ report.png)

This is meant to be a lightweight “regression test” for RAG: run it before/after changes to know if retrieval got better or worse.

Quickstart

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Minimal metrics output
python -m rag_eval_harness.cli run --dataset data/golden.sample.jsonl --k 5

# Generate a report pack
python -m rag_eval_harness.cli run --dataset data/golden.sample.jsonl --k 5 --report-dir reports/latest

Example output:

{"recall@k": 1.0, "mrr": 0.75, "n": 5}

How to interpret metrics

  • recall@k: did the expected chunk(s) show up in the top‑k results?
  • MRR: how high was the first relevant chunk ranked? (higher is better)

Dataset format

Each line is a JSON object:

{"id":"q1","question":"...","gold_chunks":["docA#3"],"retrieved":["docA#3","docX#1"]}

Notes:

  • This harness is intentionally vector-DB agnostic.
  • Your ingestion/retrieval pipeline should write retrieved so we can score deterministically.

CI

The included GitHub Action runs a smoke evaluation on each push/PR.

About

Tiny, reproducible evaluation harness for RAG systems (recall@k, MRR)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages