Skip to content

humankernel/rag-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 WikiQA Dataset Creator

screenshot

WikiQA is a tool for generating synthetic question–answer datasets using Wikipedia and Large Language Models (LLMs). It was developed to support the evaluation of Retrieval-Augmented Generation (RAG) systems, particularly this RAG evaluator.

📚 Selected Wikipedia Topics

🧮 Mathematics

💻 Computer Science

🧬 Biology

⚛️ Physics

🌍 General Topics

🧩 Question Types

Each dataset entry belongs to one of several cognitive and reasoning categories, enabling targeted evaluation of RAG models:

  1. Factual – objective, verifiable facts.
  2. 🔗 Multi-Hop – multi-step reasoning or combined facts.
  3. 🧠 Semantic – interpretation and meaning of concepts.
  4. ⚙️ Logical Reasoning – applying formal rules or laws.
  5. 💡 Creative Thinking – open-ended or hypothetical reasoning.
  6. 📏 Problem-Solving – applying formulas or methods to compute results.
  7. ⚖️ Ethical & Philosophical – moral or conceptual reflection on science.

Each question type is designed to stress different aspects of retrieval and generation in RAG systems.

📊 Evaluation Metrics

Although WikiQA only generates datasets, it is designed around RAG evaluation metrics (see Key Metrics and Evaluation Methods for RAG).

🔍 Retrieval Metrics

Metric Measures Description
Precision Relevance of retrieved docs Fraction of retrieved documents that are relevant
Recall Coverage of relevant docs Fraction of relevant documents that were retrieved
Hit Rate Top-result success % of queries retrieving ≥1 relevant doc in top-k
MRR (Mean Reciprocal Rank) Top result position Measures how high the first relevant doc ranks
NDCG Ranking quality Evaluates both relevance and order of retrieved docs
MAP (Mean Average Precision) Overall retrieval accuracy Averages precision over all relevant docs and queries

✍️ Generation Metrics

Metric Measures Example
Faithfulness Factual consistency with context “Einstein was born in Germany on March 14, 1879.”
Answer Relevance How well the answer fits the question Adds missing but relevant info like France → “Paris”
Answer Correctness Alignment with ground truth Matches true reference answer accurately

⚙️ Example Use Case

This tool can be used to:

  • Build synthetic QA datasets for RAG benchmark testing.
  • Evaluate the retrieval and generation quality of LLM-based systems.
  • Train or fine-tune retrieval models on domain-specific scientific content.

🧠 Related Projects

About

Create syntetic datasets for RAG evaluation

Topics

Resources

Stars

Watchers

Forks