WikiQA is a tool for generating synthetic question–answer datasets using Wikipedia and Large Language Models (LLMs). It was developed to support the evaluation of Retrieval-Augmented Generation (RAG) systems, particularly this RAG evaluator.
Each dataset entry belongs to one of several cognitive and reasoning categories, enabling targeted evaluation of RAG models:
- ✅ Factual – objective, verifiable facts.
- 🔗 Multi-Hop – multi-step reasoning or combined facts.
- 🧠 Semantic – interpretation and meaning of concepts.
- ⚙️ Logical Reasoning – applying formal rules or laws.
- 💡 Creative Thinking – open-ended or hypothetical reasoning.
- 📏 Problem-Solving – applying formulas or methods to compute results.
- ⚖️ Ethical & Philosophical – moral or conceptual reflection on science.
Each question type is designed to stress different aspects of retrieval and generation in RAG systems.
Although WikiQA only generates datasets, it is designed around RAG evaluation metrics (see Key Metrics and Evaluation Methods for RAG).
| Metric | Measures | Description |
|---|---|---|
| Precision | Relevance of retrieved docs | Fraction of retrieved documents that are relevant |
| Recall | Coverage of relevant docs | Fraction of relevant documents that were retrieved |
| Hit Rate | Top-result success | % of queries retrieving ≥1 relevant doc in top-k |
| MRR (Mean Reciprocal Rank) | Top result position | Measures how high the first relevant doc ranks |
| NDCG | Ranking quality | Evaluates both relevance and order of retrieved docs |
| MAP (Mean Average Precision) | Overall retrieval accuracy | Averages precision over all relevant docs and queries |
| Metric | Measures | Example |
|---|---|---|
| Faithfulness | Factual consistency with context | “Einstein was born in Germany on March 14, 1879.” |
| Answer Relevance | How well the answer fits the question | Adds missing but relevant info like France → “Paris” |
| Answer Correctness | Alignment with ground truth | Matches true reference answer accurately |
This tool can be used to:
- Build synthetic QA datasets for RAG benchmark testing.
- Evaluate the retrieval and generation quality of LLM-based systems.
- Train or fine-tune retrieval models on domain-specific scientific content.
- 🔗 RAG Evaluator: humankernel/rag-revamped
- 🧾 Undergraduate Thesis: humankernel/thesis
