Skip to content

Visualize LLM outputs against datasets, manually annotate results, and run automated evaluations to algorithmically optimize prompts.

License

Notifications You must be signed in to change notification settings

ALucek/evaluizer

Repository files navigation

Evaluizer

Evaluizer Logo

Evaluizer is an interface for evaluating and optimizing LLM prompts. It allows you to visualize outputs against datasets, manually annotate results, and run automated evaluations using both LLM judges and deterministic functions. It features GEPA (Genetic-Pareto), an optimization engine that iteratively evolves your prompts to maximize evaluation scores through reflective feedback loops.

Setup and Install

You can run Evaluizer either using Docker Compose (Recommended) or by setting up the services locally.

Option 1: Docker Compose

This is the quickest way to get up and running.

  1. Configure Environment Copy the example environment file and add your API keys (e.g., OpenAI, Anthropic).

    cp .env.example .env
    # Edit .env with your favorite editor
  2. Build and Run

    docker-compose up --build

The application will be available at http://localhost:3000.

Option 2: Local Development

For development, you can run the backend and frontend individually.

Backend

The backend is a FastAPI application that uses uv for dependency management.

cd backend

# Install dependencies
uv sync

# Run the server
uv run uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

The API will be running at http://localhost:8000.

Frontend

The frontend is a Vite React app.

cd frontend

# Install dependencies
npm install

# Run the development server
npm run dev

The frontend will be running at http://localhost:3000.

Usage

Uploading Data

Start by uploading a CSV file containing your dataset. The columns in your CSV will be available as variables for your prompts.

Creating Prompts

Use the Prompt Editor to configure the System Prompt and select a column to serve as the User Message. You can define variables in the system prompt using mustache syntax (e.g., {{variable}}) which will be populated from your CSV columns.

Prompt Versioning and Config

Evaluizer automatically versions your prompts. You can view the history of changes, revert to previous versions, and manage configuration settings for the generator model (e.g., temperature, max tokens).

Visualizing

The Data Table view allows you to see your CSV dataset alongside the generated outputs from your prompts. You can compare outputs across different prompt versions.

Evals

Evaluizer supports three types of evaluations:

Annotating (Human Eval)

You can manually review generated outputs by providing:

  • Binary Feedback: Thumbs up (1) or Thumbs down (0).
  • Text Feedback: Detailed notes on why an output was good or bad.

Note: Currently, human annotations are not used as a signal for the GEPA optimizer.

Making LLM as a Judge Evals

Configure "Judge" prompts that act as evaluators. These judges take the input, the generated output, and their own system prompt to produce a score.

Making Function Evals

For deterministic scoring, you can use Python-based function evaluations. These are implemented as plugins in the evaluations/ directory. (See evaluations/README.md for details on creating custom plugins).

Optimizing

What is GEPA?

GEPA (Genetic-Pareto) is an evolutionary optimization algorithm based on Reflective Prompt Evolution. It uses a reflective approach where it:

  1. Generates outputs using the current prompt.
  2. Evaluates them using your configured Judges and Function Evals.
  3. Reflects on the feedback to propose improved prompt variations.
  4. Iteratively evolves the prompt to maximize the combined score.

Config and Recs

To run GEPA, you need:

  1. A Base Prompt to start from.
  2. A Dataset (CSV) for training and validation.
  3. Evaluation Signals (at least one Judge or Function Eval) to define what "good" looks like.

Configure the optimization parameters (max iterations, reflection model) in the Optimizer Panel.

Contributing

Contributions welcome! Feel free to submit a PR.

Todo list:

  • Better processing (parallel vs sequential)
  • meta prompt agent

License

MIT

About

Visualize LLM outputs against datasets, manually annotate results, and run automated evaluations to algorithmically optimize prompts.

Topics

Resources

License

Stars

Watchers

Forks