Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
290 changes: 188 additions & 102 deletions website/docs/concepts/evaluation.mdx

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added website/docs/img/evaluation/custom-eval-list.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified website/docs/img/evaluation/monitor-dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified website/docs/img/evaluation/run-logs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added website/docs/img/evaluation/span-scores-tab.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
187 changes: 187 additions & 0 deletions website/docs/tutorials/custom-evaluators.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
---
sidebar_position: 3
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Custom Evaluators

This tutorial walks you through creating custom evaluators in the AMP Console. Custom evaluators let you define domain-specific quality checks using Python code or LLM judge prompt templates.

## Prerequisites

- A running AMP instance (see [Quick Start](../getting-started/quick-start.mdx))
- An agent registered in AMP with an active environment
- Familiarity with [evaluation concepts](../concepts/evaluation.mdx), especially evaluator types and evaluation levels
- For LLM judge evaluators: an API key for a [supported LLM provider](../concepts/evaluation.mdx#supported-llm-providers)

---

## Navigate to Evaluators

1. Open the AMP Console and select your agent.
2. Click the **Evaluation** tab.
3. Click the **Evaluators** sub-tab to see the evaluators list.
4. Click **Create Evaluator**.

![Evaluators list with Create Evaluator button](../img/evaluation/custom-eval-list.png)

---

## Create a Custom Evaluator

### Step 1: Set Basic Details

1. Enter a **Display Name** (e.g., "Response Format Check" or "Domain Accuracy Judge").
2. The **Identifier** is auto-generated from the display name. You can customize it (must be lowercase with hyphens, 3–128 characters).
3. Add an optional **Description** explaining what this evaluator checks.
4. Select the **Evaluator Type**:
- **Code**: write a Python function with arbitrary evaluation logic (deterministic rules, external API calls, regex matching, statistical analysis, or any combination)
- **LLM-Judge**: write a prompt template that instructs an LLM to score trace quality — use this when evaluation requires subjective judgment (semantic accuracy, domain-specific quality, or nuanced reasoning assessment)

![Basic details form](../img/evaluation/custom-eval-basic-details.png)

### Step 2: Select Evaluation Level

Select the level at which your evaluator operates:

- **Trace**: evaluates the full execution from input to output (`Trace` object)
- **Agent**: evaluates a single agent's steps and decisions (`AgentTrace` object)
- **LLM**: evaluates a single LLM call with messages and response (`LLMSpan` object)

![Evaluation level selection](../img/evaluation/custom-eval-code-details.png)

### Step 3: Write the Evaluation Logic

<Tabs>
<TabItem value="code" label="Code Evaluator" default>

The editor provides a **read-only header** with imports and the function signature (auto-generated from your selected level and config parameters). Write your logic in the **function body** below the header.

Your function must return an `EvalResult`:

- **Score**: `EvalResult(score=0.85, explanation="...")` — score between 0.0 (worst) and 1.0 (best)
- **Skip**: `EvalResult.skip("reason")` — use when the evaluator is not applicable to this input

**Example**: a trace-level evaluator that checks output contains valid JSON:

```python
def evaluate(trace: Trace) -> EvalResult:
if not trace.output:
return EvalResult.skip("No output to evaluate")

import json
try:
json.loads(trace.output)
return EvalResult(score=1.0, explanation="Output is valid JSON")
except json.JSONDecodeError as e:
return EvalResult(score=0.0, explanation=f"Invalid JSON: {e}")
```

![Code editor](../img/evaluation/custom-eval-code-editor.png)

:::tip
Use `EvalResult.skip()` instead of returning a score of 0.0 when the evaluator is not applicable. Skipped evaluations are tracked separately and do not affect aggregated scores.
:::

</TabItem>
<TabItem value="llm-judge" label="LLM-Judge Evaluator">

Use placeholders to inject trace data into your prompt. Available placeholders depend on the selected level:

- **Trace level**: `{trace.input}`, `{trace.output}`, `{trace.get_tool_steps()}`, etc.
- **Agent level**: `{agent_trace.input}`, `{agent_trace.output}`, `{agent_trace.get_tool_steps()}`, etc.
- **LLM level**: `{llm_span.input}`, `{llm_span.output}`, etc.

Write only the evaluation criteria — the system automatically wraps your prompt in scoring instructions that tell the LLM to return a structured score and explanation.

**Example**: a trace-level LLM judge for a travel booking agent:

```
You are evaluating a travel booking agent's response.

User query: {trace.input}

Agent response: {trace.output}

Tools used: {trace.get_tool_steps()}

Evaluate whether the agent:
1. Recommended flights that match the user's stated preferences (dates, budget, airline)
2. Provided accurate pricing information consistent with the tool results
3. Included all required booking details (confirmation number, departure time, gate info)

Score 1.0 if all criteria are met, 0.5 if partially met, 0.0 if the response is incorrect or misleading.
```

![LLM judge prompt editor](../img/evaluation/custom-eval-llm-judge-editor.png)

:::tip
LLM judge evaluators inherit the same **Model**, **Temperature**, and **Criteria** configuration as built-in LLM-as-Judge evaluators. These parameters are configurable when adding the evaluator to a monitor.
:::

</TabItem>
</Tabs>

### Step 4: Use the AI Copilot (Optional)

The editor includes an **AI Copilot Prompt** section — a pre-built, context-aware prompt you can copy and paste into your AI assistant (e.g., ChatGPT, Claude). Describe what you want to evaluate, and the AI will generate the evaluation code or prompt template for you.

### Step 5: Define Configuration Parameters (Optional)

Configuration parameters make your evaluator reusable with different settings across monitors. For example, a content check evaluator might accept a `keywords` parameter so different monitors can check for different terms.

1. Expand the **Config Params** section.
2. Click **Add Parameter**.
3. For each parameter, configure:
- **Key**: a Python identifier (e.g., `min_words`, `required_format`)
- **Type**: string, integer, float, boolean, array, or enum
- **Description**: shown to users when configuring the evaluator in a monitor
- **Default value**: used when not overridden
- **Constraints**: min/max for numbers, allowed values for enum types

In **Code** evaluators, parameters appear as keyword arguments in the function signature (e.g., `threshold: float = 0.5`). In **LLM-Judge** evaluators, parameters are available as `{key}` placeholders in your prompt template (e.g., `{domain}`).

### Step 6: Add Tags and Create

1. Optionally add **Tags** to categorize your evaluator (e.g., `format`, `domain-specific`, `compliance`).
2. Review your configuration.
3. Click **Create Evaluator**.

Your evaluator appears in the evaluators list and can be selected when creating or editing monitors.

---

## Use Custom Evaluators in a Monitor

Once created, custom evaluators appear in the evaluator selection grid alongside built-in evaluators when [creating or editing a monitor](./evaluation-monitors.mdx).

- Code evaluators are tagged with **code**
- LLM judge evaluators are tagged with **llm-judge**
- Your custom tags are also displayed on the evaluator cards

Select and configure custom evaluators the same way as built-in evaluators. Set parameter values, choose the LLM model (for LLM judges), and add them to the monitor.

---

## Edit and Delete Custom Evaluators

### Edit

Click an evaluator in the evaluators list to open it for editing. You can update:

- Display name and description
- Source code or prompt template
- Configuration parameter schema
- Tags

The **identifier** and **evaluation level** cannot be changed after creation.

### Delete

Click the **delete** icon on an evaluator in the list. Deletion is a soft delete. The evaluator is removed from the list, but existing monitor results referencing it are preserved.

:::info
A custom evaluator cannot be deleted while it is referenced by an active monitor. Remove the evaluator from all monitors before deleting it.
:::
88 changes: 73 additions & 15 deletions website/docs/tutorials/evaluation-monitors.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -32,15 +32,15 @@ This tutorial walks you through creating an evaluation monitor, viewing results,
Fill in the monitor configuration:

- **Monitor Title**: A descriptive name for the monitor (e.g., "Production Quality Monitor").
- **Identifier**: Auto-generated from the title. You can customize it must be lowercase with hyphens, 3–60 characters.
- **Identifier**: Auto-generated from the title. You can customize it (must be lowercase with hyphens, 3–60 characters).
- **Data Collection Type**: Choose one:
- **Past Traces** evaluate traces from a specific time window. Set a **Start Time** and **End Time**. The evaluation runs immediately after creation.
- **Future Traces** evaluate new traces on a recurring schedule. Set an **interval** in minutes (minimum 5 minutes).
- **Past Traces**: evaluate traces from a specific time window. Set a **Start Time** and **End Time**. The evaluation runs immediately after creation.
- **Future Traces**: evaluate new traces on a recurring schedule. Set an **interval** in minutes (minimum 5 minutes).

![Monitor details form](../img/evaluation/create-step1.png)

:::tip Choosing a monitor type
Use **Past Traces** when you want to assess historical agent behavior — for example, reviewing last week's interactions after a deployment. Use **Future Traces** for ongoing production quality monitoring.
Use **Past Traces** when you want to assess historical agent behavior, such as reviewing last week's interactions after a deployment. Use **Future Traces** for ongoing production quality monitoring.
:::

---
Expand All @@ -49,11 +49,11 @@ Use **Past Traces** when you want to assess historical agent behavior — for ex

1. Browse the evaluator grid. Each card shows the evaluator name, tags, and a brief description.
2. Click an evaluator card to open its details and configuration.
3. Configure parameters as needed — for example, set `max_latency_ms` for the Latency evaluator, or choose a model for an LLM-as-Judge evaluator.
3. Configure parameters as needed. For example, set `max_latency_ms` for the Latency evaluator, or choose a model for an LLM-as-Judge evaluator.
4. Click **Add Evaluator** to include it in the monitor.
5. Repeat for all evaluators you want to use. You must select at least one.

For a full reference of available evaluators and their parameters, see [Built-in Evaluators](../concepts/evaluation.mdx#built-in-evaluators).
For a full reference of available evaluators and their parameters, see [Built-in Evaluators](../concepts/evaluation.mdx#built-in-evaluators). You can also create your own (see [Custom Evaluators](./custom-evaluators.mdx)).

![Evaluator selection grid](../img/evaluation/create-step2-evaluators.png)

Expand All @@ -73,7 +73,7 @@ If you selected any LLM-as-Judge evaluators, you need to configure at least one
The **model** field on LLM-as-Judge evaluators uses `provider/model` format (e.g., `openai/gpt-4o-mini`, `anthropic/claude-sonnet-4-6`). The available models depend on the providers you have configured.

:::tip
You only need to add each provider once per monitor — all evaluators using that provider share the same credentials.
You only need to add each provider once per monitor. All evaluators using that provider share the same credentials.
:::

![LLM provider configuration](../img/evaluation/llm-provider-config.png)
Expand All @@ -97,13 +97,44 @@ After creation, you'll see your monitor in the monitor list. Click a monitor to

The monitor dashboard provides several views of your evaluation results:

- **Time Range Selector** — filter results by Last 24 Hours, Last 3 Days, Last 7 Days, or Last 30 Days.
- **Agent Performance Chart** — a radar chart showing mean scores across all evaluators, giving a quick visual summary of agent strengths and weaknesses.
- **Evaluation Summary** — total traces evaluated, skipped traces, and weighted average score.
- **Performance by Evaluator** — a time-series chart showing how each evaluator's score trends over time. Useful for spotting regressions or improvements.
- **Time Range Selector**: filter results by Last 24 Hours, Last 3 Days, Last 7 Days, or Last 30 Days. Historical monitors show their fixed trace window instead.
- **Agent Performance Chart**: a radar chart showing mean scores across all evaluators, giving a quick visual summary of agent strengths and weaknesses.
- **Evaluation Summary**: shows the weighted average score and total evaluation count, with **per-level statistics**:
- **Trace level**: number of traces evaluated, evaluator count, and skip rate
- **Agent level**: number of agent executions evaluated, evaluator count, and skip rate
- **LLM level**: number of LLM invocations evaluated, evaluator count, and skip rate

Only levels with configured evaluators appear in the summary.
- **Run Summary**: latest run status with quick access to run history.
- **Performance by Evaluator**: a time-series chart showing how each evaluator's score trends over time. Useful for spotting regressions or improvements.

![Monitor dashboard](../img/evaluation/monitor-dashboard.png)

### Score Breakdowns

When your monitor includes agent-level or LLM-level evaluators, the dashboard shows additional breakdown tables below the performance chart.

#### Score Breakdown by Agent

A table with one row per agent found in the evaluated traces. Each row shows:

- **Agent name**: the agent's identifier from the trace
- **Evaluator scores**: mean score for each agent-level evaluator, displayed as color-coded percentage chips. A dash (–) indicates the evaluator was skipped for that agent.
- **Count**: the number of agent executions evaluated

This helps you identify which agent in a multi-agent system needs improvement.

#### Score Breakdown by Model

A table with one row per LLM model used across the evaluated traces. Each row shows:

- **Model name**: the LLM model identifier (e.g., `gpt-4o`, `claude-sonnet-4-6`)
- **Evaluator scores**: mean score for each LLM-level evaluator
- **Count**: the number of LLM invocations evaluated

This helps you compare quality across different models used by your agents.


### Run History

The dashboard also shows a history of all evaluation runs. Each run displays:
Expand All @@ -114,8 +145,8 @@ The dashboard also shows a history of all evaluation runs. Each run displays:

You can take actions on individual runs:

- **Rerun** re-execute the evaluation run.
- **View Logs** see detailed execution logs for troubleshooting.
- **Rerun**: re-execute the evaluation run.
- **View Logs**: see detailed execution logs for troubleshooting.

![Run history](../img/evaluation/run-history.png)

Expand All @@ -127,6 +158,33 @@ Click **View Logs** on any run to open the log viewer. This displays the applica

---

## View Scores in Trace View

Evaluation scores are also visible directly in the trace view, making it easy to investigate specific agent interactions without switching to the monitor dashboard.

### Score Column in Traces Table

The traces list includes a **Score** column showing the average evaluator score for each trace. Scores are color-coded (green for high scores, red for low), giving you a quick visual indicator of which traces need attention.

![Traces table with score column](../img/evaluation/traces-table-scores.png)

### Scores in Span Details

Click any trace to open the trace timeline. Select a span to view its details panel:

1. **Score chips in the header**: evaluator scores appear as color-coded percentage chips in the span's basic info section, alongside duration, token count, and model information.
2. **Scores tab**: a dedicated tab shows each evaluator's result:
- **Evaluator name**: prefixed with the monitor name when the same evaluator appears in multiple monitors (e.g., `production-monitor / Accuracy`)
- **Score chip**: color-coded percentage (green for high, red for low)
- **Explanation**: markdown-rendered explanation from the evaluator describing why this score was given
- **Skipped evaluators**: shown with a skip reason instead of a score

Trace-level scores appear on the root span. Agent-level and LLM-level scores appear on their respective agent and LLM spans.

![Span details with scores tab](../img/evaluation/span-scores-tab.png)

---

## Start and Suspend a Monitor

This applies to **continuous monitors** only.
Expand All @@ -135,7 +193,7 @@ This applies to **continuous monitors** only.
- **Start**: Click the **play** button on a suspended monitor. Evaluation resumes within 60 seconds.

:::info
Historical monitors cannot be started or suspended — they run once when created.
Historical monitors cannot be started or suspended. They run once when created.
:::

---
Expand All @@ -144,7 +202,7 @@ Historical monitors cannot be started or suspended — they run once when create

1. Click the **edit** (pencil) icon in the monitor list actions column.
2. The monitor configuration wizard opens with the current settings.
3. Update the fields you want to change display name, evaluators, evaluator parameters, LLM provider credentials, interval (for continuous monitors), or time range (for historical monitors).
3. Update the fields you want to change: display name, evaluators, evaluator parameters, LLM provider credentials, interval (for continuous monitors), or time range (for historical monitors).
4. Click **Save** to apply the changes.

:::info
Expand Down
3 changes: 2 additions & 1 deletion website/sidebars.ts
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,8 @@ const sidebars: SidebarsConfig = {
collapsed: false,
items: [
'tutorials/observe-first-agent',
'tutorials/evaluation-monitors'
'tutorials/evaluation-monitors',
'tutorials/custom-evaluators'
],
},
{
Expand Down
4 changes: 2 additions & 2 deletions website/versioned_docs/version-v0.9.x/_constants.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{/* This file stores constants used across the documentation */}

export const versions = {
latestVersion: 'v0.8.x',
quickStartDockerTag: 'v0.8.0'
latestVersion: 'v0.9.x',
quickStartDockerTag: 'v0.9.0'
};
Loading
Loading