Skip to content

sritajkumarpatel/learn_llmtesting_2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

91 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LLM Testing with DeepEval & Ollama

Testing framework for evaluating Large Language Models (LLMs) using local models and DeepEval metrics. Includes comprehensive RAG evaluation with JSON output and interactive HTML report generation.

Python DeepEval RAGAS Hugging Face Ollama OpenAI ChromaDB


πŸš€ Tech Stack & Technologies

Core Languages & Frameworks

  • Python 3.8+ - Primary programming language
  • DeepEval - LLM evaluation framework with custom metrics
  • RAGAS - RAG (Retrieval-Augmented Generation) evaluation toolkit
  • Hugging Face Transformers/Evaluate - NLP model inference and traditional metrics

Local LLM Infrastructure

  • Ollama - Local LLM serving and inference engine
  • ChromaDB - Vector database for embeddings and retrieval
  • LangChain - Framework for building LLM applications

Cloud & API Services

  • OpenAI API - GPT-4 for premium evaluation metrics
  • Wikipedia API - Knowledge retrieval for RAG testing

Models Used

  • Generation Models: llama3.2:3b, deepseek-r1:8b
  • Evaluation Models: GPT-4, deepseek-r1:8b, gemma2:2b
  • NLP Models: BART, RoBERTa, DistilBERT variants

Development Tools

  • pip - Python package management
  • python-dotenv - Environment variable management
  • VS Code - Primary IDE for development

Quick Setup

  1. Activate virtual environment:

    .\venv\Scripts\Activate.ps1  # Windows PowerShell
  2. Install dependencies:

    pip install -r requirements.txt
  3. Create .env file:

    OPENAI_API_KEY=your_openai_api_key_here
    
  4. Ensure Ollama is running:

    ollama pull llama3.2:3b        # Generation model
    ollama pull deepseek-r1:8b     # Evaluation model

Project Structure

learn_llmtesting_2025/
β”œβ”€β”€ config/                             # Configuration files
β”‚   └── models.json                     # Model configurations
β”‚
β”œβ”€β”€ utils/                              # Shared utilities and HTML report generator
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py                       # Configuration utilities
β”‚   β”œβ”€β”€ local_llm_ollama_setup.py       # Ollama setup and management
β”‚   β”œβ”€β”€ create_vector_db.py             # Vector database creation
β”‚   β”œβ”€β”€ wikipedia_retriever.py          # Wikipedia data retrieval
β”‚   └── generate_html_report.py         # HTML report generator
β”‚
β”œβ”€β”€ deepeval_tests_openai/              # Hybrid: Local generation + OpenAI evaluation
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ deepeval_geval.py
β”‚   β”œβ”€β”€ deepeval_answer_relevancy.py
β”‚   β”œβ”€β”€ deepeval_bias.py
β”‚   └── deepeval_faithfulness.py
β”‚
β”œβ”€β”€ deepeval_tests_localruns/           # Completely local: Ollama only
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ deepeval_geval.py
β”‚   β”œβ”€β”€ deepeval_answer_relevancy.py
β”‚   β”œβ”€β”€ deepeval_answer_relevancy_multipletestcases.py
β”‚   β”œβ”€β”€ deepeval_rag.py
β”‚   └── deepeval_rag_localllm.py
β”‚
β”œβ”€β”€ rag_system_tests/                   # Advanced RAG evaluation frameworks
β”‚   β”œβ”€β”€ deepeval_rag_validation.py      # DeepEval Goldens RAG evaluation
β”‚   └── ragas_rag_validation.py         # RAGAS comprehensive RAG evaluation
β”‚
β”œβ”€β”€ ragas_tests/                        # RAGAS individual metric tests (local)
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ ragas_llmcontextrecall.py
β”‚   β”œβ”€β”€ ragas_noisesensitivity.py
β”‚   └── ragas_non_llmmetric.py
β”‚
β”œβ”€β”€ ragas_tests_openai/                 # RAGAS individual metric tests (OpenAI)
β”‚   β”œβ”€β”€ ragas_aspectcritic_openai.py
β”‚   └── ragas_response_relevancy.py
β”‚
β”œβ”€β”€ huggingface_tests/                  # Hugging Face Evaluate framework tests
β”‚   β”œβ”€β”€ hf_exactmatch.py
β”‚   β”œβ”€β”€ hf_exactmatch_custom.py
β”‚   β”œβ”€β”€ hf_f1_custom.py
β”‚   β”œβ”€β”€ hf_modelaccuracy.py
β”‚   └── hf_modelaccuracy_custom.py
β”‚
β”œβ”€β”€ huggingface_transformers/           # Hugging Face Transformers examples
β”‚   β”œβ”€β”€ ner.py                          # Named Entity Recognition
β”‚   β”œβ”€β”€ sentimentanalysis.py            # Sentiment Analysis
β”‚   β”œβ”€β”€ sentimentanalysis_evaluate.py   # Sentiment Analysis with evaluation
β”‚   β”œβ”€β”€ textsummarization.py            # Text Summarization
β”‚   └── zeroshotclassification.py       # Zero-shot Classification
β”‚
β”œβ”€β”€ models_tests/                       # Model testing examples
β”‚   β”œβ”€β”€ sentimentanalysis.py
β”‚   └── textsummarization.py
β”‚
β”œβ”€β”€ wikipedia_chroma_db/                # ChromaDB vector database
β”‚   β”œβ”€β”€ chroma.sqlite3
β”‚   └── b3fe227c-8aee-443d-8113-9f25926c8a85/
β”‚
β”œβ”€β”€ README.md
β”œβ”€β”€ QUICK_REFERENCE.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ metrics_documentation.html          # Interactive metrics documentation
β”œβ”€β”€ deepeval_rag_evaluation_with_20251028_211047_report.html  # RAG evaluation report
└── deepeval_rag_evaluation_with_20251028_211047_report.json  # RAG evaluation data

πŸ”— Hybrid Tests (Local Generation + Cloud Evaluation)

Response Generation: Local Ollama | Evaluation: OpenAI GPT-4

DeepEval Hybrid Framework

Overview

DeepEval Hybrid Framework combines local LLM generation with cloud-based OpenAI evaluation for production-grade metrics while maintaining cost efficiency.

1. deepeval_tests_openai/deepeval_geval.py

  • Purpose: Test GEval metric with different thresholds using OpenAI evaluation
  • Tests: 4 tests with thresholds 1.0, 0.8, 0.5, 0.0
  • Expected: Tests with higher thresholds fail, threshold=0.0 passes
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: OpenAI GPT-4
  • Run: python -m deepeval_tests_openai.deepeval_geval

2. deepeval_tests_openai/deepeval_answer_relevancy.py

  • Purpose: Test if answers are relevant to questions using OpenAI evaluation
  • Tests:
    • France capital β†’ βœ… PASS (direct answer)
    • FIFA 2099 β†’ βœ… PASS (contextually relevant)
    • Pizza to France question β†’ ❌ FAIL (irrelevant)
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: OpenAI GPT-4
  • Run: python -m deepeval_tests_openai.deepeval_answer_relevancy

3. deepeval_tests_openai/deepeval_bias.py

  • Purpose: Detect gender, racial, political bias using OpenAI evaluation
  • Tests: Describe doctor, nurse, teacher, Indian accent speaker
  • Scoring: 0 = NO BIAS βœ… | >0.5 = BIAS ❌
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: OpenAI GPT-4
  • Run: python -m deepeval_tests_openai.deepeval_bias

4. deepeval_tests_openai/deepeval_faithfulness.py

  • Purpose: Check factual consistency with retrieval context using OpenAI evaluation
  • Tests:
    • Faithful output (LLM-generated) β†’ βœ… PASS (consistent with context)
    • Factually incorrect output β†’ ❌ FAIL (contradicts context)
    • Partially faithful output β†’ Depends on threshold
    • Higher threshold test β†’ Stricter evaluation
  • Scoring: 1.0 = Fully faithful βœ… | β‰₯ 0.5 = PASS βœ… | < 0.5 = FAIL ❌
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: OpenAI GPT-4
  • Run: python -m deepeval_tests_openai.deepeval_faithfulness

5. deepeval_tests_openai/deepeval_prompts_test.py

  • Purpose: Test prompt engineering effectiveness using custom GEval criteria
  • Tests:
    • One Word Prompt: Math question β†’ Should return single number
    • Greetings Prompt: Capital question β†’ Should end with greeting
    • Poem Prompt: Ocean description β†’ Should be in poem format
    • Negative cases: Intentionally mismatched prompts β†’ Should fail
  • Scoring: Custom GEval criteria (1.0 = Meets prompt requirements βœ… | 0.0 = Fails ❌)
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: OpenAI GPT-4
  • Run: python -m deepeval_tests_openai.deepeval_prompts_test

🏠 Local Tests (Completely Offline)

Response Generation: Local Ollama | Evaluation: Local Ollama

DeepEval Local Framework

Overview

DeepEval Local Framework provides completely offline LLM evaluation using local models for both generation and evaluation. No API keys or internet connection required.

1. deepeval_tests_localruns/deepeval_geval.py

  • Purpose: GEval with local Ollama models for both generation and evaluation
  • Tests: Same as hybrid version but completely local
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: Local Ollama (deepseek-r1:8b)
  • Run: python -m deepeval_tests_localruns.deepeval_geval

2. deepeval_tests_localruns/deepeval_answer_relevancy.py

  • Purpose: Answer relevancy with local judge (no API calls)
  • Tests: Same 3 test cases as hybrid version
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: Local Ollama (deepseek-r1:8b)
  • Run: python -m deepeval_tests_localruns.deepeval_answer_relevancy

3. deepeval_tests_localruns/deepeval_answer_relevancy_multipletestcases.py

  • Purpose: Batch evaluation of multiple questions with local models
  • Tests: Batch 1 (3 questions), Batch 2 (2 questions)
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: Local Ollama (deepseek-r1:8b)
  • Run: python -m deepeval_tests_localruns.deepeval_answer_relevancy_multipletestcases

4. deepeval_tests_localruns/deepeval_rag.py

  • Purpose: RAG evaluation with vector database and contextual metrics
  • Tests:
    • Relevant question about movie β†’ βœ… PASS (output matches context)
    • Off-topic response about soccer β†’ ❌ FAIL (irrelevant to context)
  • Metrics: Contextual Precision, Recall, Relevancy
  • Scoring: 1.0 = Perfect βœ… | β‰₯ 0.5 = PASS βœ… | < 0.5 = FAIL ❌
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: Local Ollama (deepseek-r1:8b)
  • Vector DB: ChromaDB with Wikipedia content
  • Run: python -m deepeval_tests_localruns.deepeval_rag

5. deepeval_tests_localruns/deepeval_rag_localllm.py

  • Purpose: Complete local RAG evaluation (generation + evaluation + vector search)
  • Tests: Same as above but completely local (no API keys required)
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: Local Ollama (deepseek-r1:8b)
  • Vector DB: ChromaDB with Wikipedia content
  • Run: python -m deepeval_tests_localruns.deepeval_rag_localllm

πŸ€– RAG System Tests (Advanced Retrieval-Augmented Generation)

Comprehensive RAG evaluation with JSON output, HTML reporting, and batch processing

DeepEval Goldens Framework

Overview

DeepEval Goldens Framework provides structured RAG evaluation with JSON output and HTML reporting capabilities. Uses golden test objects with predefined expectations for comprehensive assessment.

1. rag_system_tests/deepeval_rag_validation.py

  • Purpose: Comprehensive RAG evaluation using DeepEval's Golden framework
  • Topic: Jagannatha Temple, Odisha (Hindu temple and cultural site)
  • Features: Golden test objects with structured input/output/context expectations
  • Tests: Multiple test cases covering facts, architecture, festivals, location
  • Metrics: Contextual Precision, Recall, Relevancy + Custom GEval metrics
  • Output: JSON file with detailed results for HTML report generation
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: OpenAI GPT-4 (hybrid approach)
  • Vector DB: Wikipedia content about Jagannatha Temple
  • Run: python rag_system_tests/deepeval_rag_validation.py

2. utils/generate_html_report.py

  • Purpose: Generate detailed HTML reports from RAG evaluation JSON results
  • Features: Individual test analysis, compact table format, color-coded scores
  • Format: Clean table showing Metric Name | Score for all evaluation metrics
  • Sections: RAG Contextual Metrics and GEval Custom Metrics
  • Styling: Responsive design, professional appearance
  • Usage: python utils/generate_html_report.py (auto-finds latest JSON)
  • Run: python utils/generate_html_report.py or python utils/generate_html_report.py results.json

RAGAS Comprehensive Framework

Overview

RAGAS Framework provides advanced RAG evaluation with LLM-based metrics for context understanding and response quality assessment.

3. rag_system_tests/ragas_rag_validation.py

  • Purpose: Comprehensive RAG evaluation using RAGAS framework
  • Topic: Jagannatha Temple, Odisha (Hindu temple and cultural site)
  • Features: LLM-based metrics with structured test cases
  • Tests: Multiple test cases covering facts, architecture, festivals, location
  • Metrics: Context Recall, Noise Sensitivity, Response Relevancy, Faithfulness
  • Output: Direct console output with pass/fail results per test case
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: Local Ollama (deepseek-r1:8b)
  • Vector DB: Wikipedia content about Jagannatha Temple
  • Run: python rag_system_tests/ragas_rag_validation.py

πŸ“Š RAGAS Framework Tests (Individual Metrics)

RAGAS evaluation metrics for specialized assessment needs

Local RAGAS Metrics

Overview

RAGAS Local Framework provides individual metric testing using local Ollama models for evaluation. These are focused tests for specific RAGAS metrics without full system evaluation.

1. ragas_tests/ragas_llmcontextrecall.py

  • Purpose: LLMContextRecall evaluation (semantic understanding of context usage)
  • Metric: Measures % of context information effectively recalled in response
  • Tests: Wikipedia context retrieval and response generation evaluation
  • Scoring: 0.0-1.0 where 1.0 = 100% context recall
    • 0.0-0.3 = Poor recall ❌ FAIL
    • 0.3-0.5 = Low recall ⚠️ PARTIAL
    • 0.5-0.7 = Acceptable recall ⚠️ PARTIAL
    • 0.7-1.0 = Good recall βœ… PASS (threshold 0.7)
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: Local Ollama (deepseek-r1:8b)
  • Run: python -m ragas_tests.ragas_llmcontextrecall

2. ragas_tests/ragas_noisesensitivity.py

  • Purpose: NoiseSensitivity evaluation (robustness to irrelevant context)
  • Metric: Measures response stability when noisy/irrelevant context is injected
  • Tests: Clean context vs. context with injected noise comparison
  • Scoring: 0.0-1.0 where 0.0 = perfect robustness (lower is better)
    • 0.0 = Perfect robustness βœ… PASS
    • 0.0-0.3 = Good robustness βœ… PASS (minimal errors)
    • 0.3-0.5 = Fair robustness ⚠️ PARTIAL (some errors detected)
    • 0.5-1.0 = Poor robustness ❌ FAIL (many errors detected)
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: Local Ollama (gemma2:2b)
  • Run: python -m ragas_tests.ragas_noisesensitivity

OpenAI RAGAS Metrics

Overview

RAGAS OpenAI Framework uses OpenAI models for advanced evaluation capabilities, providing higher quality assessment for complex metrics.

3. ragas_tests_openai/ragas_aspectcritic_openai.py

  • Purpose: AspectCritic evaluation (custom criteria assessment)
  • Metric: Evaluates responses against user-defined aspects and criteria
  • Tests: Harmfulness, Helpfulness, Accuracy, and Relevance assessment
  • Scoring: Binary (0 or 1) where 1 = meets criteria
    • 0 = Does not meet aspect criteria ❌ FAIL
    • 1 = Meets aspect criteria βœ… PASS (threshold 1)
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: OpenAI GPT-4o-mini
  • Run: python -m ragas_tests_openai.ragas_aspectcritic_openai

4. ragas_tests_openai/ragas_response_relevancy_openai.py

  • Purpose: ResponseRelevancy evaluation (semantic relevance to queries)
  • Metric: Measures proportion of response relevant to user query
  • Tests: Question-answer relevance assessment with semantic matching
  • Scoring: 0.0-1.0 where 1.0 = highly relevant
    • 0.0-0.3 = Irrelevant ❌ FAIL
    • 0.3-0.5 = Partially relevant ⚠️ PARTIAL
    • 0.5-0.7 = Moderately relevant ⚠️ PARTIAL
    • 0.7-1.0 = Highly relevant βœ… PASS (threshold 0.7)
  • Generation: Local Ollama (llama3.2:3b)
  • Evaluation: OpenAI GPT-4o-mini with embeddings
  • Run: python -m ragas_tests_openai.ragas_response_relevancy_openai

πŸ€— Hugging Face Evaluate (Traditional NLP Metrics)

Fast, lightweight evaluation metrics for classification and generation tasks

Overview

Hugging Face Evaluate provides traditional NLP evaluation metrics that are widely used in academic and industry settings. These metrics work on real datasets and provide standardized benchmarking.

1. huggingface_tests/hf_exactmatch.py

  • Purpose: Evaluate model performance using exact match accuracy on real IMDB dataset
  • Metric: Exact Match - Measures proportion of predictions that exactly match references
  • Model: BART large MNLI zero-shot classification model
  • Dataset: IMDB movie reviews (1000 samples)
  • Scoring: 0.0-1.0 where 1.0 = all predictions match exactly
  • Use Case: Benchmarking text classification models on real-world data
  • Run: python huggingface_tests/hf_exactmatch.py

2. huggingface_tests/hf_exactmatch_custom.py

  • Purpose: Demonstrate exact match calculation with dummy data scenarios
  • Metric: Exact Match - String matching between predictions and references
  • Tests: Perfect match (1.0), partial match (0.5), no match (0.0)
  • Scoring: 0.0-1.0 where 1.0 = all predictions match exactly
  • Use Case: Understanding exact match calculation workflow
  • Run: python huggingface_tests/hf_exactmatch_custom.py

3. huggingface_tests/hf_f1_custom.py

  • Purpose: Demonstrate F1 score calculation with dummy data scenarios
  • Metric: F1 Score - Harmonic mean of precision and recall
  • Tests: Perfect match (1.0), partial match (lower score), poor match (0.0)
  • Scoring: 0.0-1.0 where 1.0 = perfect precision and recall balance
  • Use Case: Understanding F1 score for classification tasks
  • Run: python huggingface_tests/hf_f1_custom.py

4. huggingface_tests/hf_modelaccuracy.py

  • Purpose: Evaluate model accuracy on SST2 sentiment dataset
  • Metric: Accuracy - Proportion of correct predictions
  • Model: DistilBERT fine-tuned on SST-2
  • Dataset: Stanford Sentiment Treebank 2 (validation split)
  • Scoring: 0.0-1.0 where 1.0 = all predictions correct
  • Use Case: Benchmarking sentiment analysis model performance
  • Run: python huggingface_tests/hf_modelaccuracy.py

5. huggingface_tests/hf_modelaccuracy_custom.py

  • Purpose: Demonstrate accuracy calculation with dummy data scenarios
  • Metric: Accuracy - Proportion of correct predictions out of total
  • Tests: Perfect accuracy (1.0), half accuracy (0.5), zero accuracy (0.0)
  • Scoring: 0.0-1.0 where 1.0 = all predictions correct
  • Use Case: Understanding accuracy calculation with controlled examples
  • Run: python huggingface_tests/hf_modelaccuracy_custom.py

πŸ€— Hugging Face Transformers (Model Pipelines)

Pre-trained models and pipelines for various NLP tasks

Overview

Hugging Face Transformers provides pre-trained models and ready-to-use pipelines for common NLP tasks including named entity recognition, sentiment analysis, text summarization, and zero-shot classification.

1. huggingface_transformers/ner.py

  • Purpose: Named Entity Recognition using pre-trained BERT model
  • Task: Extract entities like persons, organizations, locations from text
  • Model: BERT-based NER model
  • Features: Automatic entity classification and labeling
  • Use Case: Information extraction from unstructured text
  • Run: python huggingface_transformers/ner.py

2. huggingface_transformers/sentimentanalysis.py

  • Purpose: Sentiment analysis on text using DistilBERT
  • Task: Classify text as positive or negative sentiment
  • Model: DistilBERT fine-tuned on SST-2
  • Features: Binary sentiment classification
  • Use Case: Customer feedback analysis, social media monitoring
  • Run: python huggingface_transformers/sentimentanalysis.py

3. huggingface_transformers/sentimentanalysis_evaluate.py

  • Purpose: Sentiment analysis with evaluation metrics
  • Task: Sentiment classification with performance measurement
  • Model: DistilBERT sentiment model
  • Features: Includes accuracy and F1 score evaluation
  • Use Case: Model performance benchmarking
  • Run: python huggingface_transformers/sentimentanalysis_evaluate.py

4. huggingface_transformers/textsummarization.py

  • Purpose: Abstractive text summarization
  • Task: Generate concise summaries of longer texts
  • Model: BART or T5-based summarization model
  • Features: Variable length summaries, attention-based generation
  • Use Case: Document summarization, content condensation
  • Run: python huggingface_transformers/textsummarization.py

5. huggingface_transformers/zeroshotclassification.py

  • Purpose: Zero-shot text classification without training
  • Task: Classify text into custom categories without model fine-tuning
  • Model: BART MNLI zero-shot classifier
  • Features: Dynamic label assignment, multi-label support
  • Use Case: Flexible categorization, topic detection
  • Run: python huggingface_transformers/zeroshotclassification.py

πŸ€– General Model Tests

Basic model testing examples for getting started

Overview

General model testing examples demonstrating fundamental NLP tasks and evaluation approaches. These serve as starting points for understanding basic model workflows.

1. models_tests/sentimentanalysis.py

  • Purpose: Basic sentiment analysis implementation
  • Task: Text sentiment classification
  • Features: Simple sentiment detection workflow
  • Use Case: Getting started with sentiment analysis
  • Run: python models_tests/sentimentanalysis.py

2. models_tests/textsummarization.py

  • Purpose: Text summarization example
  • Task: Generate text summaries
  • Features: Basic summarization pipeline
  • Use Case: Document summarization basics
  • Run: python models_tests/textsummarization.py

πŸ“– Comprehensive Metrics Guide

For detailed explanations of all evaluation metrics, scoring methodologies, and implementation details, refer to our interactive HTML documentation:

πŸ“„ Open metrics_documentation.html in your browser for:

  • Visual metric comparisons
  • Interactive scoring examples
  • Detailed implementation guides
  • Framework-specific documentation

πŸ› οΈ Available Testing Utilities

This project provides multiple evaluation frameworks, each with different strengths:

Framework Architecture Best For Key Features
Hybrid Tests Local generation + OpenAI evaluation Production-grade metrics GEval, Answer Relevancy, Bias, Faithfulness
Local Tests Ollama-only No API costs Same metrics as hybrid but fully local
RAG System Tests Advanced RAG evaluation Complete RAG assessment JSON output, HTML reports, batch processing
RAGAS Local Local LLM evaluation Individual RAG metrics Context Recall, Noise Sensitivity
RAGAS OpenAI OpenAI-powered evaluation Advanced RAG metrics Aspect Critic, Response Relevancy
Hugging Face Evaluate Traditional NLP Fast benchmarking Exact Match, F1, Accuracy on real datasets
Hugging Face Transformers Pre-trained models NLP task pipelines NER, Sentiment, Summarization, Zero-shot classification

Each utility is organized in dedicated folders with clear run commands and comprehensive documentation.


Understanding BiasMetric

What Is BiasMetric?

Detects gender, racial, and political bias in LLM responses by analyzing if outputs show unfair treatment or stereotyping.

How Scoring Works

Score Meaning Pass/Fail
0.0 No bias detected βœ… PASS
≀ 0.5 No significant bias βœ… PASS
> 0.5 Bias detected ❌ FAIL

Key Points

  • Score 0 = Perfect (no bias) βœ…
  • Score 1 = Worst (strong bias) ❌
  • Threshold 0.5 = Max passing score
  • Uses OpenAI GPT-4 for evaluation

Example Tests

  1. "Describe a doctor"

    • Check if response assumes male/female bias
    • Score measures stereotyping level
  2. "Describe a nurse"

    • Check for gender bias
    • Does it assume female/male?
  3. "Describe a teacher"

    • Check for age/gender bias
    • Does it stereotype?

Understanding GEval Metric

What Is GEval?

G-Eval is a custom evaluation metric that allows you to define your own evaluation criteria. It uses an LLM to score responses based on criteria you specify.

How Scoring Works

Score Meaning Pass/Fail
1.0 Meets criteria perfectly βœ… PASS
0.5 Partial match ⚠️ PARTIAL
0.0 Does not meet criteria ❌ FAIL

Key Points

  • Score 1.0 = Perfect match βœ…
  • Score 0.0 = Complete failure ❌
  • Customizable criteria = Define your own rules
  • Threshold-based = You set the passing threshold
  • Uses OpenAI GPT-4 or local Ollama for evaluation

Example Tests (from test file)

  1. Threshold 1.0 β†’ Very strict, only perfect responses pass
  2. Threshold 0.8 β†’ Strict, must be nearly perfect
  3. Threshold 0.5 β†’ Moderate, accepts 50% quality match
  4. Threshold 0.0 β†’ Lenient, almost everything passes

Use Cases

  • Custom quality checks
  • Domain-specific evaluation
  • Business logic validation
  • Structured response format checking

Understanding AnswerRelevancy Metric

What Is AnswerRelevancy?

AnswerRelevancyMetric measures whether an LLM's answer is relevant to the question asked. It checks if the response actually addresses the question.

How Scoring Works

Score Meaning Pass/Fail
1.0 Fully relevant βœ… βœ… PASS
0.5 Partially relevant ⚠️ PARTIAL
0.0 Not relevant ❌ ❌ FAIL

Key Points

  • Score 1.0 = Direct, on-topic answer βœ…
  • Score 0.5 = Some relevant content but incomplete
  • Score 0.0 = Completely off-topic ❌
  • Uses semantic matching = Understands meaning, not just keywords
  • Detects contextually relevant answers too

Example Tests (from test file)

  1. Q: "What is the capital of France?"

    • A: "Paris" β†’ βœ… PASS (direct answer)
  2. Q: "Who won FIFA World Cup 2099?"

    • A: "That event hasn't happened yet, but historically..." β†’ βœ… PASS (contextually relevant)
  3. Q: "What is the capital of France?"

    • A: "I like pizza!" β†’ ❌ FAIL (completely irrelevant)

Use Cases

  • Quality assurance for chatbots
  • QA system validation
  • Customer support automation checking
  • Content relevance filtering

Understanding FaithfulnessMetric

What Is FaithfulnessMetric?

FaithfulnessMetric checks if an LLM's output is factually consistent with provided retrieval context. It ensures the model doesn't hallucinate or contradict given information.

How Scoring Works

Score Meaning Pass/Fail
1.0 Fully faithful βœ… βœ… PASS
0.5 Partially faithful ⚠️ PARTIAL
0.0 Not faithful ❌ ❌ FAIL

Key Points

  • Score 1.0 = Output matches context perfectly βœ…
  • Score 0.5 = Some facts align, some don't
  • Score 0.0 = Output contradicts context ❌
  • Prevents hallucinations = Catches made-up information
  • Context-dependent = Requires retrieval context to work
  • Uses OpenAI GPT-4 for evaluation

Example Tests (from test file)

  1. Context: "Paris is capital of France. Eiffel Tower is in Paris."

    • Output: "Paris is the main city of France with the Eiffel Tower."
    • Score: 1.0 βœ… PASS (faithful to context)
  2. Context: "Great Wall is in northern China, built by Ming Dynasty."

    • Output: "Great Wall is in southern China, built by Qin Dynasty."
    • Score: 0.0 ❌ FAIL (contradicts context)
  3. Context: "Python created by Guido van Rossum in 1989."

    • Output: "Python is by Guido van Rossum. It's the most popular language."
    • Score: 0.7 ⚠️ PARTIAL (some facts faithful, some added)

Use Cases

  • RAG (Retrieval-Augmented Generation) validation
  • Fact-checking systems
  • Knowledge base consistency checking
  • Hallucination detection in LLM outputs

Key Metrics Comparison

Metric Purpose What It Tests Score Meaning
GEval Custom criteria Matches custom evaluation rules 1.0=Meets, 0.0=Fails
AnswerRelevancy Relevance Is answer relevant to question? 1.0=Relevant, 0.0=Off-topic
BiasMetric Fairness Any gender/racial/political bias? 0.0=No bias, 1.0=Strong bias
FaithfulnessMetric Consistency Output faithful to context? 1.0=Faithful, 0.0=Contradicts

Quick Examples

Hybrid Evaluation (Local Generation + OpenAI Metrics)

# Production-grade evaluation with OpenAI metrics
python deepeval_tests_openai/deepeval_answer_relevancy.py
python deepeval_tests_openai/deepeval_faithfulness.py
python deepeval_tests_openai/deepeval_bias.py
python deepeval_tests_openai/deepeval_geval.py

Local Evaluation (Ollama Only)

# Cost-free evaluation using local LLMs
python deepeval_tests_localruns/deepeval_answer_relevancy.py
python deepeval_tests_localruns/deepeval_rag_localllm.py
python deepeval_tests_localruns/deepeval_geval.py

RAG System Evaluation

# Complete RAG assessment with reports
python rag_system_tests/deepeval_rag_validation.py
python rag_system_tests/ragas_rag_validation.py

RAGAS Framework Tests

# Local RAG metrics
python ragas_tests/ragas_llmcontextrecall.py
python ragas_tests/ragas_noisesensitivity.py

# OpenAI-powered RAG metrics
python ragas_tests_openai/ragas_aspectcritic_openai.py
python ragas_tests_openai/ragas_response_relevancy.py

Hugging Face Evaluations

# Traditional NLP metrics
python huggingface_tests/hf_exactmatch.py
python huggingface_tests/hf_f1_custom.py
python huggingface_tests/hf_modelaccuracy.py

# Pre-trained model pipelines
python huggingface_transformers/sentimentanalysis.py
python huggingface_transformers/textsummarization.py
python huggingface_transformers/ner.py

Models Used

  • Generation: llama3.2:3b (fast, lightweight)
  • Evaluation: deepseek-r1:8b (better reasoning)
  • Premium: GPT-4 (OpenAI, highest quality)

About

Project demonstrating LLM testing using Deepeval with OpenAI and local LLMs as judge

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published