SAF-Eval (Search-Augmented Factuality Evaluator) is a modular Python package for evaluating the factuality of AI-generated responses. Based on academic research, it implements a systematic approach to measuring factual accuracy by breaking down responses into atomic facts and evaluating them against retrieved evidence.
- Modular Pipeline: Extract atomic facts, check relevance, retrieve supporting documents, evaluate factuality
- Self-Containment Processing: Automatically detect and fix non-self-contained facts by adding context
- Few-Shot Learning: Improve fact extraction using domain-specific examples
- Fact Deduplication: Identify and remove similar or duplicate facts to avoid redundant evaluations
- Comprehensive Logging: Detailed logging of the entire evaluation process for analysis and debugging
- Customizable Evaluation: Define your own categories and scoring rubrics
- Provider-Agnostic: Use any LLM provider through a consistent interface
- Flexible Retrieval: Integrate with any document retrieval system
- Comprehensive Metrics: Get detailed factuality scores and evaluations
SAF-Eval requires Python 3.12 or later.
# Clone the repository
git clone https://github.com/chandralegend/saf-eval.git
cd saf-eval
# Install dependencies with Poetry
poetry install
pip install saf-eval
import asyncio
import os
from dotenv import load_dotenv
from saf_eval.config import Config, LoggingConfig
from saf_eval.core.pipeline import EvaluationPipeline
from saf_eval.extraction.extractor import FactExtractor
from saf_eval.containment.checker import ContainmentChecker
from saf_eval.llm.providers.openai import OpenAILLM
from saf_eval.retrieval.providers.simple import SimpleRetriever
from saf_eval.evaluation.classifier import FactClassifier
from saf_eval.evaluation.scoring import FactualityScorer
# Load environment variables (for API keys)
load_dotenv()
async def evaluate_response():
# Initialize components with shared config
config = Config(
scoring_rubric={
"supported": 1.0,
"contradicted": 0.0,
"unverifiable": 0.5
},
retrieval_config={"top_k": 3},
logging=LoggingConfig(level="INFO", console=True, file=True, log_dir="./logs")
)
llm = OpenAILLM(model="gpt-4", api_key=os.getenv("OPENAI_API_KEY"))
# Create a knowledge base
knowledge_base = {
"Mount Everest": "Mount Everest is the highest mountain above sea level at 29,032 feet (8,849 meters)."
}
# Setup the pipeline with components that share the same config
pipeline = EvaluationPipeline(
config=config,
extractor=FactExtractor(config=config, llm=llm),
retriever=SimpleRetriever(config=config, knowledge_base=knowledge_base),
classifier=FactClassifier(config=config, llm=llm),
scorer=FactualityScorer(config=config),
containment_checker=ContainmentChecker(config=config, llm=llm) # Add containment checker
)
# Evaluate a response
response = "Mount Everest, at 29,032 feet, is the tallest mountain on Earth."
context = "Information about geographical features"
result = await pipeline.run(response, context)
print(f"Factuality Score: {result.factuality_score:.2f}")
if __name__ == "__main__":
asyncio.run(evaluate_response())
SAF-Eval follows a modular architecture with the following key components:
- Core: Pipeline coordination and data models
- Extraction: Breaking down responses into atomic facts
- Containment: Checking and fixing non-self-contained facts
- Relevancy: Assessing relevance of facts to the context
- Retrieval: Finding supporting documents for verification
- Evaluation: Classifying facts and calculating factuality scores
- LLM: Abstraction layer for language model providers
- Utils: Utility functions including deduplication and logging
The ContainmentChecker
helps identify and fix facts that require additional context to be understood:
from saf_eval.containment.checker import ContainmentChecker
from saf_eval.llm.providers.openai import OpenAILLM
# Initialize the components
llm = OpenAILLM(model="gpt-4", api_key=os.getenv("OPENAI_API_KEY"))
containment_checker = ContainmentChecker(config=config, llm=llm)
# Check if facts are self-contained
checked_facts = await containment_checker.check_containment(facts, response)
# Fix non-self-contained facts by adding context
self_contained_facts = await containment_checker.self_contain_facts(
checked_facts,
response,
context="Optional additional context to help with self-containment"
)
# Example: Converting "He wrote it in 1851" to "Herman Melville wrote Moby Dick in 1851"
The fact extractor can now use context to improve extraction quality:
extractor = FactExtractor(config=config, llm=llm)
facts = await extractor.extract_facts(
response="Melville's masterpiece is considered one of the Great American Novels.",
context="Discussion about the novel Moby Dick by Herman Melville"
)
You can provide examples to improve fact extraction through few-shot learning:
from typing import List, Tuple, Optional
# Define an example provider function
def my_example_provider(response: str, context: Optional[str] = None, **kwargs) -> List[Tuple[str, List[str]]]:
"""Provide domain-specific examples for fact extraction."""
# Return a list of (example_text, [fact1, fact2, ...]) tuples
return [
(
"The Eiffel Tower was completed in 1889 and stands at 330 meters tall.",
["The Eiffel Tower was completed in 1889.", "The Eiffel Tower is 330 meters tall."]
),
# More examples...
]
# Use the example provider in the extractor
extractor = FactExtractor(
config=config,
llm=llm,
example_provider=my_example_provider
)
# Extract facts with examples to guide the LLM
facts = await extractor.extract_facts(
response="The Golden Gate Bridge was completed in 1937.",
context="Information about famous structures"
)
Example providers can dramatically improve extraction quality by demonstrating the desired level of granularity and format.
You can implement your own retrieval system:
from typing import List
from saf_eval.core.models import AtomicFact, RetrievedDocument
from saf_eval.retrieval.base import RetrieverBase
class MyCustomRetriever(RetrieverBase):
async def retrieve(self, fact: AtomicFact, **kwargs) -> List[RetrievedDocument]:
# Implement your retrieval logic here
# ...
return documents
Customize how facts are classified:
from saf_eval.evaluation.classifier import FactClassifier
classifier = FactClassifier(
llm=my_llm,
categories=["accurate", "partially_accurate", "inaccurate", "uncertain"]
)
The pipeline automatically deduplicates similar facts to avoid redundant evaluations:
from saf_eval.utils.deduplication import deduplicate_facts
from typing import List
from saf_eval.core.models import AtomicFact
# Default deduplication is already included in the pipeline, but can be customized:
def my_custom_deduplication(facts: List<AtomicFact]) -> List<AtomicFact]:
# Custom logic to identify and merge similar facts
# ...
return deduplicated_facts
# Use custom deduplication in the pipeline
pipeline = EvaluationPipeline(
# ...other arguments...
deduplication_fn=my_custom_deduplication
)
SAF-Eval provides detailed logging throughout the evaluation pipeline:
from saf_eval.config import Config, LoggingConfig
from saf_eval.utils.logging import get_logger
# Configure logging in the config object
config = Config(
# ...other config settings...
logging=LoggingConfig(
level="DEBUG", # Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
console=True, # Log to console
file=True, # Log to file
log_dir="./logs", # Directory for log files
json_format=True # Output logs in JSON format for easier parsing
)
)
# Create a custom logger in any component
logger = get_logger(
name="my-component",
level=config.logging.level,
log_dir=config.logging.log_dir,
console=config.logging.console,
file=config.logging.file,
json_format=config.logging.json_format
)
# Log with structured data
logger.info("Processing fact", {
"fact_id": "123",
"fact_text": "Paris is the capital of France",
"is_self_contained": True
})
The logging system tracks:
- Fact extraction steps and results
- Self-containment checks and fixes
- Deduplication decisions
- Document retrieval statistics
- Classification results
- Overall performance metrics
For a complete example, see examples/logging_example.py
.
SAF-Eval uses a unified configuration system. The Config
class centralizes all settings:
from saf_eval.config import Config
config = Config(
# The scoring rubric defines both evaluation categories and their weights
scoring_rubric={
"fully_supported": 1.0,
"partially_supported": 0.6,
"contradicted": 0.0
},
# Additional configuration for retrieval methods
retrieval_config={
"top_k": 5,
"min_relevance": 0.7
},
# Custom LLM configuration
llm_config={
"temperature": 0.1,
"max_tokens": 500
}
)
# Categories are automatically derived from the scoring rubric
print(config.evaluation_categories) # ['fully_supported', 'partially_supported', 'contradicted']
See the examples/
directory for more advanced usage patterns, including self_containment_example.py
which demonstrates how to process non-self-contained facts.
SAF-Eval includes several example scripts to demonstrate key features:
- basic_usage.py: Simple end-to-end evaluation
- custom_retriever.py: Implementing a custom retrieval system
- custom_evaluation.py: Custom evaluation categories and scoring
- self_containment_example.py: Handling non-self-contained facts
- deduplication_example.py: Custom fact deduplication
- few_shot_extraction_example.py: Domain-specific extraction examples
- logging_example.py: Comprehensive logging setup
Contributions are welcome! Please feel free to submit a Pull Request.
This work is inspired by research on evaluating factuality in large language models, particularly:
@misc{wei2024long,
title={Long-form factuality in large language models},
author={Wei, Jerry and Yang, Chengrun and Song, Xinying and Lu, Yifeng and Hu, Nathan and Huang, Jie and Tran, Dustin and Peng, Daiyi and Liu, Ruibo and Huang, Da and Du, Cosmo and Le, Quoc V.},
year={2024},
url={https://arxiv.org/abs/2403.18802},
}
The approach of decomposing responses into atomic facts and evaluating them against retrieved evidence follows methodologies outlined in the above paper. We are grateful to the authors for their contributions to this field.
This project is licensed under the MIT License - see the LICENSE file for details.