🧠 WikiQA Dataset Creator

WikiQA is a tool for generating synthetic question–answer datasets using Wikipedia and Large Language Models (LLMs). It was developed to support the evaluation of Retrieval-Augmented Generation (RAG) systems, particularly this RAG evaluator.

📚 Selected Wikipedia Topics

🧮 Mathematics

💻 Computer Science

🧬 Biology

⚛️ Physics

🌍 General Topics

🧩 Question Types

Each dataset entry belongs to one of several cognitive and reasoning categories, enabling targeted evaluation of RAG models:

✅ Factual – objective, verifiable facts.
🔗 Multi-Hop – multi-step reasoning or combined facts.
🧠 Semantic – interpretation and meaning of concepts.
⚙️ Logical Reasoning – applying formal rules or laws.
💡 Creative Thinking – open-ended or hypothetical reasoning.
📏 Problem-Solving – applying formulas or methods to compute results.
⚖️ Ethical & Philosophical – moral or conceptual reflection on science.

Each question type is designed to stress different aspects of retrieval and generation in RAG systems.

📊 Evaluation Metrics

Although WikiQA only generates datasets, it is designed around RAG evaluation metrics (see Key Metrics and Evaluation Methods for RAG).

🔍 Retrieval Metrics

Metric	Measures	Description
Precision	Relevance of retrieved docs	Fraction of retrieved documents that are relevant
Recall	Coverage of relevant docs	Fraction of relevant documents that were retrieved
Hit Rate	Top-result success	% of queries retrieving ≥1 relevant doc in top-k
MRR (Mean Reciprocal Rank)	Top result position	Measures how high the first relevant doc ranks
NDCG	Ranking quality	Evaluates both relevance and order of retrieved docs
MAP (Mean Average Precision)	Overall retrieval accuracy	Averages precision over all relevant docs and queries

✍️ Generation Metrics

Metric	Measures	Example
Faithfulness	Factual consistency with context	“Einstein was born in Germany on March 14, 1879.”
Answer Relevance	How well the answer fits the question	Adds missing but relevant info like France → “Paris”
Answer Correctness	Alignment with ground truth	Matches true reference answer accurately

⚙️ Example Use Case

This tool can be used to:

Build synthetic QA datasets for RAG benchmark testing.
Evaluate the retrieval and generation quality of LLM-based systems.
Train or fine-tune retrieval models on domain-specific scientific content.

🧠 Related Projects

🔗 RAG Evaluator: humankernel/rag-revamped
🧾 Undergraduate Thesis: humankernel/thesis

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
dataset		dataset
paper		paper
src		src
tests/lib		tests/lib
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 WikiQA Dataset Creator

📚 Selected Wikipedia Topics

🧮 Mathematics

💻 Computer Science

🧬 Biology

⚛️ Physics

🌍 General Topics

🧩 Question Types

📊 Evaluation Metrics

🔍 Retrieval Metrics

✍️ Generation Metrics

⚙️ Example Use Case

🧠 Related Projects

About

Uh oh!

Languages

humankernel/rag-eval

Folders and files

Latest commit

History

Repository files navigation

🧠 WikiQA Dataset Creator

📚 Selected Wikipedia Topics

🧮 Mathematics

💻 Computer Science

🧬 Biology

⚛️ Physics

🌍 General Topics

🧩 Question Types

📊 Evaluation Metrics

🔍 Retrieval Metrics

✍️ Generation Metrics

⚙️ Example Use Case

🧠 Related Projects

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages