A simple Q&A bot for technical documentation designed to test and compare different LLM evaluation frameworks including DeepEval, LangChain Evaluation, RAGAS, and OpenAI Evals.
This project serves as a testbed for comparing how different evaluation frameworks assess the same RAG (Retrieval-Augmented Generation) system.
- Clone the repository
git clone https://github.com/LiteObject/eval-framework-sandbox.git
cd eval-framework-sandbox- Install dependencies
pip install -r requirements.txt- Set up environment variables
cp .env.example .env
# Edit .env with your API keys (optional unless running remote evals)- Ask a question
python -m src.main "How do you install the Python requests library?"The bot will print a synthesized answer and list matching documents.
- Run the unit tests
pytest- (Optional) Try an evaluation framework
- Update
.envwith the relevant API keys or enable the Ollama flag for a local model (details below). - Install extras:
pip install -r requirements.txtalready includes optional libs, orpip install .[eval]after editable install. - Use the runner scripts in
evaluations/as starting points; each script writes results intoresults/.
- Update
The core QA bot already runs fully offline using TF-IDF retrieval. If you also want LangChain's evaluators to call a local Ollama model instead of OpenAI:
- Install Ollama and pull a model, e.g.
ollama pull llama3. - Set the following environment variables (via
.envor your shell):LANGCHAIN_USE_OLLAMA=trueOLLAMA_MODEL=llama3(or any other pulled model)- Optionally
OLLAMA_BASE_URL=http://localhost:11434if you're running Ollama on a non-default host/port.
- Leave
OPENAI_API_KEYblank; the LangChain evaluator will detect the Ollama flag and useChatOllama.
If LANGCHAIN_USE_OLLAMA is false, the evaluator falls back to ChatOpenAI and expects a valid OPENAI_API_KEY plus LANGCHAIN_OPENAI_MODEL (defaults to gpt-3.5-turbo).
These integrations are opt-in. Install the additional dependencies with:
pip install .[eval]Each runner expects the dataset built from the JSON files in data/questions.json
and data/ground_truth.json. The helper below mirrors what the runners use
internally:
from pathlib import Path
from evaluations.utils import load_dataset_from_files
dataset = load_dataset_from_files(
Path("data/questions.json"),
Path("data/ground_truth.json"),
)-
Set
DEEPEVAL_API_KEYin.envif you plan to submit results to the hosted DeepEval service (local scoring works without it). -
Run the runner programmatically:
from evaluations.deepeval_runner import DeepEvalRunner runner = DeepEvalRunner() result = runner.evaluate(dataset) print(result.score, result.details)
The report is also written to
results/deepeval_result.json.
-
Choose your backend:
- Remote OpenAI models: set
OPENAI_API_KEYand optionallyLANGCHAIN_OPENAI_MODEL(defaults togpt-3.5-turbo). - Local Ollama: set
LANGCHAIN_USE_OLLAMA=true,OLLAMA_MODEL, and optionallyOLLAMA_BASE_URL; no OpenAI key required.
- Remote OpenAI models: set
-
Invoke the runner:
from evaluations.langchain_eval_runner import LangChainEvalRunner runner = LangChainEvalRunner() result = runner.evaluate(dataset) print(result.score, result.details)
LangChain will call the configured chat model to grade responses and store the output at
results/langchain_result.json.
-
Install the
ragasextras (already included in.[eval]). Some metrics call an LLM; setOPENAI_API_KEYor configure RagAS to use a local model before running. -
Evaluate the dataset:
from evaluations.ragas_runner import RagasRunner runner = RagasRunner() result = runner.evaluate(dataset) print(result.score, result.details)
The raw metric results are saved to
results/ragas_result.json.
This repository only prepares the dataset and relies on OpenAI's CLI for the
actual evaluation. Ensure evals is installed and OPENAI_API_KEY is set, then
use evaluations/openai_eval_runner.py to export a dataset and follow the
OpenAI Evals documentation to launch the
experiments with oaieval.
data/: Test questions, ground truth, and source documentssrc/: Core Q&A bot implementationevaluations/: Framework-specific evaluation scriptsresults/: Evaluation results and comparisons (gitignored except for.gitkeep)
- Answer Correctness
- Context Relevance
- Faithfulness
- Answer Similarity
- Response Time
- Hallucination Rate