Evaluizer

Evaluizer is an interface for evaluating and optimizing LLM prompts. It allows you to visualize outputs against datasets, manually annotate results, and run automated evaluations using both LLM judges and deterministic functions. It features GEPA (Genetic-Pareto), an optimization engine that iteratively evolves your prompts to maximize evaluation scores through reflective feedback loops.

Setup and Install

You can run Evaluizer either using Docker Compose (Recommended) or by setting up the services locally.

Option 1: Docker Compose

This is the quickest way to get up and running.

Configure Environment Copy the example environment file and add your API keys (e.g., OpenAI, Anthropic).
```
cp .env.example .env
# Edit .env with your favorite editor
```
Build and Run
```
docker-compose up --build
```

The application will be available at http://localhost:3000.

Option 2: Local Development

For development, you can run the backend and frontend individually.

Backend

The backend is a FastAPI application that uses uv for dependency management.

cd backend

# Install dependencies
uv sync

# Run the server
uv run uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

The API will be running at http://localhost:8000.

Frontend

The frontend is a Vite React app.

cd frontend

# Install dependencies
npm install

# Run the development server
npm run dev

The frontend will be running at http://localhost:3000.

Usage

Uploading Data

Start by uploading a CSV file containing your dataset. The columns in your CSV will be available as variables for your prompts.

Creating Prompts

Use the Prompt Editor to configure the System Prompt and select a column to serve as the User Message. You can define variables in the system prompt using mustache syntax (e.g., {{variable}}) which will be populated from your CSV columns.

Prompt Versioning and Config

Evaluizer automatically versions your prompts. You can view the history of changes, revert to previous versions, and manage configuration settings for the generator model (e.g., temperature, max tokens).

Visualizing

The Data Table view allows you to see your CSV dataset alongside the generated outputs from your prompts. You can compare outputs across different prompt versions.

Evals

Evaluizer supports three types of evaluations:

Annotating (Human Eval)

You can manually review generated outputs by providing:

Binary Feedback: Thumbs up (1) or Thumbs down (0).
Text Feedback: Detailed notes on why an output was good or bad.

Note: Currently, human annotations are not used as a signal for the GEPA optimizer.

Making LLM as a Judge Evals

Configure "Judge" prompts that act as evaluators. These judges take the input, the generated output, and their own system prompt to produce a score.

Making Function Evals

For deterministic scoring, you can use Python-based function evaluations. These are implemented as plugins in the evaluations/ directory. (See evaluations/README.md for details on creating custom plugins).

Optimizing

What is GEPA?

GEPA (Genetic-Pareto) is an evolutionary optimization algorithm based on Reflective Prompt Evolution. It uses a reflective approach where it:

Generates outputs using the current prompt.
Evaluates them using your configured Judges and Function Evals.
Reflects on the feedback to propose improved prompt variations.
Iteratively evolves the prompt to maximize the combined score.

Config and Recs

To run GEPA, you need:

A Base Prompt to start from.
A Dataset (CSV) for training and validation.
Evaluation Signals (at least one Judge or Function Eval) to define what "good" looks like.

Configure the optimization parameters (max iterations, reflection model) in the Optimizer Panel.

Contributing

Contributions welcome! Feel free to submit a PR.

Todo list:

Better processing (parallel vs sequential)
meta prompt agent

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
backend		backend
evaluations		evaluations
frontend		frontend
media		media
testing_datasets		testing_datasets
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluizer

Setup and Install

Option 1: Docker Compose

Option 2: Local Development

Backend

Frontend

Usage

Uploading Data

Creating Prompts

Prompt Versioning and Config

Visualizing

Evals

Annotating (Human Eval)

Making LLM as a Judge Evals

Making Function Evals

Optimizing

What is GEPA?

Config and Recs

Contributing

License

About

Uh oh!

Languages

License

ALucek/evaluizer

Folders and files

Latest commit

History

Repository files navigation

Evaluizer

Setup and Install

Option 1: Docker Compose

Option 2: Local Development

Backend

Frontend

Usage

Uploading Data

Creating Prompts

Prompt Versioning and Config

Visualizing

Evals

Annotating (Human Eval)

Making LLM as a Judge Evals

Making Function Evals

Optimizing

What is GEPA?

Config and Recs

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages