Eval Framework Sandbox

A simple Q&A bot for technical documentation designed to test and compare different LLM evaluation frameworks including DeepEval, LangChain Evaluation, RAGAS, and OpenAI Evals.

Purpose

This project serves as a testbed for comparing how different evaluation frameworks assess the same RAG (Retrieval-Augmented Generation) system.

Quick Start

Clone the repository

git clone https://github.com/LiteObject/eval-framework-sandbox.git
cd eval-framework-sandbox

Install dependencies

pip install -r requirements.txt

Set up environment variables

cp .env.example .env
# Edit .env with your API keys (optional unless running remote evals)

Ask a question

python -m src.main "How do you install the Python requests library?"

The bot will print a synthesized answer and list matching documents.

Run the unit tests

pytest

(Optional) Try an evaluation framework
- Update .env with the relevant API keys or enable the Ollama flag for a local model (details below).
- Install extras: pip install -r requirements.txt already includes optional libs, or pip install .[eval] after editable install.
- Use the runner scripts in evaluations/ as starting points; each script writes results into results/.

Using a local Ollama model with LangChain evaluation

The core QA bot already runs fully offline using TF-IDF retrieval. If you also want LangChain's evaluators to call a local Ollama model instead of OpenAI:

Install Ollama and pull a model, e.g. ollama pull llama3.
Set the following environment variables (via .env or your shell):
- LANGCHAIN_USE_OLLAMA=true
- OLLAMA_MODEL=llama3 (or any other pulled model)
- Optionally OLLAMA_BASE_URL=http://localhost:11434 if you're running Ollama on a non-default host/port.
Leave OPENAI_API_KEY blank; the LangChain evaluator will detect the Ollama flag and use ChatOllama.

If LANGCHAIN_USE_OLLAMA is false, the evaluator falls back to ChatOpenAI and expects a valid OPENAI_API_KEY plus LANGCHAIN_OPENAI_MODEL (defaults to gpt-3.5-turbo).

Evaluation Frameworks

These integrations are opt-in. Install the additional dependencies with:

pip install .[eval]

Each runner expects the dataset built from the JSON files in data/questions.json and data/ground_truth.json. The helper below mirrors what the runners use internally:

from pathlib import Path
from evaluations.utils import load_dataset_from_files

dataset = load_dataset_from_files(
     Path("data/questions.json"),
     Path("data/ground_truth.json"),
)

DeepEval

Set DEEPEVAL_API_KEY in .env if you plan to submit results to the hosted DeepEval service (local scoring works without it).

Run the runner programmatically:

from evaluations.deepeval_runner import DeepEvalRunner

runner = DeepEvalRunner()
result = runner.evaluate(dataset)
print(result.score, result.details)

The report is also written to results/deepeval_result.json.

LangChain Evaluation

Choose your backend:
- Remote OpenAI models: set OPENAI_API_KEY and optionally LANGCHAIN_OPENAI_MODEL (defaults to gpt-3.5-turbo).
- Local Ollama: set LANGCHAIN_USE_OLLAMA=true, OLLAMA_MODEL, and optionally OLLAMA_BASE_URL; no OpenAI key required.

Invoke the runner:

from evaluations.langchain_eval_runner import LangChainEvalRunner

runner = LangChainEvalRunner()
result = runner.evaluate(dataset)
print(result.score, result.details)

LangChain will call the configured chat model to grade responses and store the output at results/langchain_result.json.

RAGAS

Install the ragas extras (already included in .[eval]). Some metrics call an LLM; set OPENAI_API_KEY or configure RagAS to use a local model before running.

Evaluate the dataset:

from evaluations.ragas_runner import RagasRunner

runner = RagasRunner()
result = runner.evaluate(dataset)
print(result.score, result.details)

The raw metric results are saved to results/ragas_result.json.

OpenAI Evals

This repository only prepares the dataset and relies on OpenAI's CLI for the actual evaluation. Ensure evals is installed and OPENAI_API_KEY is set, then use evaluations/openai_eval_runner.py to export a dataset and follow the OpenAI Evals documentation to launch the experiments with oaieval.

Project Structure

data/: Test questions, ground truth, and source documents
src/: Core Q&A bot implementation
evaluations/: Framework-specific evaluation scripts
results/: Evaluation results and comparisons (gitignored except for .gitkeep)

Metrics Evaluated

Answer Correctness
Context Relevance
Faithfulness
Answer Similarity
Response Time
Hallucination Rate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Eval Framework Sandbox

Purpose

Quick Start

Using a local Ollama model with LangChain evaluation

Evaluation Frameworks

DeepEval

LangChain Evaluation

RAGAS

OpenAI Evals

Project Structure

Metrics Evaluated

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
evaluations		evaluations
notebooks		notebooks
results		results
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Uh oh!

Uh oh!

LiteObject/eval-framework-sandbox

Folders and files

Latest commit

History

Repository files navigation

Eval Framework Sandbox

Purpose

Quick Start

Using a local Ollama model with LangChain evaluation

Evaluation Frameworks

DeepEval

LangChain Evaluation

RAGAS

OpenAI Evals

Project Structure

Metrics Evaluated

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages