A comprehensive, production-grade SaaS platform for evaluating Retrieval-Augmented Generation (RAG) systems. Built for correctness, retrieval quality, hallucination detection, and observability.
EvRAG helps you evaluate your RAG pipelines with:
- Comprehensive Retrieval Metrics: Recall@K, Precision@K, MRR, MAP, Hit Rate, Coverage
- Generation Quality Assessment: Faithfulness, Answer Relevance, Context Utilization, Semantic Similarity
- Multi-Signal Hallucination Detection: LLM-as-Judge, Citation Check, Embedding Drift
- Visual Dashboards: Interactive charts and per-query breakdowns
- Run Comparison: Track improvements across evaluation runs
- Extensible Architecture: Easy to add new metrics and evaluators
EvRAG/
βββ backend/ # FastAPI backend
β βββ app/
β β βββ api/ # API routes
β β βββ evaluation/ # Core evaluation logic
β β β βββ retrieval/ # Retrieval metrics
β β β βββ generation/ # Generation metrics
β β β βββ hallucination/ # Hallucination detection
β β βββ db/ # Database models
β β βββ services/ # Business logic
β β βββ rag/ # RAG pipeline interfaces
β βββ requirements.txt
β
βββ frontend/ # Next.js frontend
βββ app/ # Pages
βββ components/ # UI components
βββ lib/ # API client & utils
cd backend
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Setup database
createdb evrag
# Configure environment
cp .env.example .env
# Edit .env with your settings
# Run server
uvicorn app.main:app --reloadBackend runs at: http://localhost:8000
cd frontend
# Install dependencies
npm install
# Configure environment
cp .env.local.example .env.local
# Edit .env.local if needed
# Run dev server
npm run devFrontend runs at: http://localhost:3000
| Metric | Description | Formula |
|---|---|---|
| Recall@K | Fraction of relevant docs in top K | relevant_in_topK / total_relevant |
| Precision@K | Fraction of top K that are relevant | relevant_in_topK / K |
| MRR | Mean Reciprocal Rank | 1 / rank_of_first_relevant |
| MAP | Mean Average Precision | Ξ£(Precision@k Γ relevance@k) / total_relevant |
| Hit Rate | At least one relevant doc found | 1 if any relevant else 0 |
| Coverage | Fraction of GT retrieved | GT_docs_retrieved / total_GT_docs |
| Metric | Description |
|---|---|
| Faithfulness | Answer grounded in retrieved context |
| Answer Relevance | Answer addresses the query |
| Context Utilization | Answer uses retrieved information |
| Semantic Similarity | Similarity to ground truth answer |
| ROUGE-L | Longest common subsequence F-measure |
| F1 Score | Token-level precision & recall |
Multi-Signal Approach:
-
LLM-as-Judge (40% weight)
- Uses GPT to identify unsupported claims
- Fallback to rule-based detection
-
Citation Check (35% weight)
- Each answer sentence must map to context
- Reports uncited spans
-
Embedding Drift (25% weight)
- Semantic distance between answer and context
- High drift indicates hallucination
Output:
- Aggregated hallucination score (0-1)
- Highlighted hallucinated text spans
- Severity classification
{
"items": [
{
"query": "What is retrieval-augmented generation?",
"ground_truth_docs": ["doc1", "doc2"],
"ground_truth_answer": "RAG combines retrieval with generation..."
}
]
}query,ground_truth_docs,ground_truth_answer
"What is RAG?","[""doc1"", ""doc2""]","RAG is a technique..."{"query": "What is RAG?", "ground_truth_docs": ["doc1"], "ground_truth_answer": "..."}
{"query": "How does RAG work?", "ground_truth_docs": ["doc2"], "ground_truth_answer": "..."}Your RAG endpoint should accept:
{
"query": "user query text",
"top_k": 5
}And return:
{
"retrieved_docs": [
{"id": "doc1", "text": "document content..."},
{"id": "doc2", "text": "more content..."}
],
"generated_answer": "The answer is..."
}- Real-time evaluation progress
- Aggregate metrics visualization
- Per-query breakdown with drill-down
- Highlighted unsupported claims
- Multi-signal confidence scores
- Citation coverage tracking
- Side-by-side metrics
- Delta calculations
- Trend analysis
Backend:
- FastAPI (async API framework)
- PostgreSQL (data storage)
- SQLAlchemy (ORM)
- SentenceTransformers (embeddings)
- OpenAI (optional LLM judge)
Frontend:
- Next.js 14 (React framework)
- Tailwind CSS (styling)
- shadcn/ui (UI components)
- Recharts (data visualization)
-
Upload Dataset
- CSV/JSON/JSONL with queries and ground truth
- Validates schema on upload
-
Create Evaluation Run
- Connect RAG API endpoint
- Configure parameters
-
Run Evaluation
- Async processing in background
- Real-time progress tracking
-
View Results
- Comprehensive metrics dashboard
- Per-query analysis
- Hallucination detection results
-
Compare Runs
- Track improvements over time
- Identify regressions
Use the mock RAG pipeline for testing without a real RAG system:
# In evaluation creation, leave rag_endpoint empty
# System will use MockRAGPipeline- No authentication implemented (add as needed)
- CORS configured for localhost (update for production)
- SQL injection protected via SQLAlchemy
- Input validation via Pydantic
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]npm run build
# Deploy dist/ folderThis is a production-ready foundation. To extend:
- Add new metrics: Create new files in
backend/app/evaluation/ - Custom evaluators: Implement in
evaluation/modules - UI enhancements: Add components in
frontend/components/
MIT License - feel free to use in your projects
- Authentication & multi-tenancy
- Billing integration
- More LLM providers (Anthropic, Cohere)
- Batch evaluation API
- Custom metric definitions
- Export reports (PDF/CSV)
- Webhooks for run completion
- A/B testing framework
- Correctness over speed: Accurate metrics are critical
- Extensibility: Easy to add new evaluators
- Production-ready: Clean architecture, error handling, logging
- No unnecessary abstractions: Pragmatic code structure
- Observable: Track everything that matters
Built with β€οΈ for the RAG community