Tiny, reproducible evaluation harness for RAG systems (golden set + metrics).
- Runs a golden dataset (JSONL)
- Computes retrieval metrics: recall@k, MRR
- Produces a shareable report:
report.json,report.md(+report.png)
This is meant to be a lightweight “regression test” for RAG: run it before/after changes to know if retrieval got better or worse.
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Minimal metrics output
python -m rag_eval_harness.cli run --dataset data/golden.sample.jsonl --k 5
# Generate a report pack
python -m rag_eval_harness.cli run --dataset data/golden.sample.jsonl --k 5 --report-dir reports/latestExample output:
{"recall@k": 1.0, "mrr": 0.75, "n": 5}- recall@k: did the expected chunk(s) show up in the top‑k results?
- MRR: how high was the first relevant chunk ranked? (higher is better)
Each line is a JSON object:
{"id":"q1","question":"...","gold_chunks":["docA#3"],"retrieved":["docA#3","docX#1"]}Notes:
- This harness is intentionally vector-DB agnostic.
- Your ingestion/retrieval pipeline should write
retrievedso we can score deterministically.
The included GitHub Action runs a smoke evaluation on each push/PR.