Evaluizer is an interface for evaluating and optimizing LLM prompts. It allows you to visualize outputs against datasets, manually annotate results, and run automated evaluations using both LLM judges and deterministic functions. It features GEPA (Genetic-Pareto), an optimization engine that iteratively evolves your prompts to maximize evaluation scores through reflective feedback loops.
You can run Evaluizer either using Docker Compose (Recommended) or by setting up the services locally.
This is the quickest way to get up and running.
-
Configure Environment Copy the example environment file and add your API keys (e.g., OpenAI, Anthropic).
cp .env.example .env # Edit .env with your favorite editor -
Build and Run
docker-compose up --build
The application will be available at http://localhost:3000.
For development, you can run the backend and frontend individually.
The backend is a FastAPI application that uses uv for dependency management.
cd backend
# Install dependencies
uv sync
# Run the server
uv run uvicorn app.main:app --reload --host 0.0.0.0 --port 8000The API will be running at http://localhost:8000.
The frontend is a Vite React app.
cd frontend
# Install dependencies
npm install
# Run the development server
npm run devThe frontend will be running at http://localhost:3000.
Start by uploading a CSV file containing your dataset. The columns in your CSV will be available as variables for your prompts.
Use the Prompt Editor to configure the System Prompt and select a column to serve as the User Message. You can define variables in the system prompt using mustache syntax (e.g., {{variable}}) which will be populated from your CSV columns.
Evaluizer automatically versions your prompts. You can view the history of changes, revert to previous versions, and manage configuration settings for the generator model (e.g., temperature, max tokens).
The Data Table view allows you to see your CSV dataset alongside the generated outputs from your prompts. You can compare outputs across different prompt versions.
Evaluizer supports three types of evaluations:
You can manually review generated outputs by providing:
- Binary Feedback: Thumbs up (1) or Thumbs down (0).
- Text Feedback: Detailed notes on why an output was good or bad.
Note: Currently, human annotations are not used as a signal for the GEPA optimizer.
Configure "Judge" prompts that act as evaluators. These judges take the input, the generated output, and their own system prompt to produce a score.
For deterministic scoring, you can use Python-based function evaluations. These are implemented as plugins in the evaluations/ directory.
(See evaluations/README.md for details on creating custom plugins).
GEPA (Genetic-Pareto) is an evolutionary optimization algorithm based on Reflective Prompt Evolution. It uses a reflective approach where it:
- Generates outputs using the current prompt.
- Evaluates them using your configured Judges and Function Evals.
- Reflects on the feedback to propose improved prompt variations.
- Iteratively evolves the prompt to maximize the combined score.
To run GEPA, you need:
- A Base Prompt to start from.
- A Dataset (CSV) for training and validation.
- Evaluation Signals (at least one Judge or Function Eval) to define what "good" looks like.
Configure the optimization parameters (max iterations, reflection model) in the Optimizer Panel.
Contributions welcome! Feel free to submit a PR.
Todo list:
- Better processing (parallel vs sequential)
- meta prompt agent





