Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 12 additions & 6 deletions .github/workflows/evaluate-agent.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@ name: Evaluate Trail Guide Agent

on:
# Uncomment the lines below to enable automatic evaluation on pull requests
# pull_request:
# branches: [main]
# paths:
# - 'src/agents/trail_guide_agent/**'
pull_request:
branches: [main]
paths:
- 'src/agents/trail_guide_agent/**'
workflow_dispatch:

permissions:
Expand Down Expand Up @@ -44,9 +44,14 @@ jobs:
env:
AZURE_AI_PROJECT_ENDPOINT: ${{ secrets.AZURE_AI_PROJECT_ENDPOINT }}
MODEL_NAME: ${{ vars.MODEL_NAME || 'gpt-4.1' }}
AZURE_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }}
AZURE_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
AZURE_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
run: |
python src/evaluators/evaluate_agent.py > evaluation_results.txt
python src/evaluators/evaluate_agent.py > evaluation_results.txt 2>&1 || true
cat evaluation_results.txt
# Fail the step if the script wrote an error marker
grep -q "Evaluation FAILED" evaluation_results.txt && exit 1 || exit 0

- name: Comment PR with results
if: github.event_name == 'pull_request'
Expand All @@ -55,6 +60,7 @@ jobs:
script: |
const fs = require('fs');
const results = fs.readFileSync('evaluation_results.txt', 'utf8');
const reportUrl = '${{ steps.run.outputs.report_url }}' || 'Not available';

const body = `## 🎯 Agent Evaluation Results

Expand All @@ -69,7 +75,7 @@ jobs:

</details>

📊 [View full results in Azure AI Foundry Portal](${{ steps.run.outputs.report_url }})
📊 [View full results in Azure AI Foundry Portal](${reportUrl})

**Evaluation Criteria:**
- Intent Resolution (score ≥ 3)
Expand Down
16 changes: 16 additions & 0 deletions docs/02-prompt-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,13 @@ Now you'll use the Azure Developer CLI to deploy all required Azure resources.

Sign in with your Azure credentials when prompted.

> ⚠️ **Important**
> In some environments, the VS Code integrated terminal may crash or close during the interactive login flow.
> If this happens, authenticate using explicit credentials instead:
> ```powershell
> az login --username <your-username> --password <your-password>
> ```

1. Provision resources:

```powershell
Expand All @@ -98,6 +105,15 @@ Now you'll use the Azure Developer CLI to deploy all required Azure resources.
azd env get-values > .env
```

> ⚠️ **Important – File Encoding**
>
> After generating the `.env` file, make sure it is saved using **UTF-8** encoding.
>
> In editors like **VS Code**, check the encoding indicator in the bottom-right corner.
> If it shows **UTF-16 LE** (or any encoding other than UTF-8), click it, choose **Save with Encoding**, and select **UTF-8**.
>
> Using the wrong encoding may cause environment variables to be read incorrectly.

This creates a `.env` file in your project root with all the provisioned resource information.

1. Add the agent configuration to your `.env` file:
Expand Down
16 changes: 16 additions & 0 deletions docs/03-design-optimize-prompts.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,13 @@ Now you'll use the Azure Developer CLI to deploy all required Azure resources.

Sign in with your Azure credentials when prompted.

> ⚠️ **Important**
> In some environments, the VS Code integrated terminal may crash or close during the interactive login flow.
> If this happens, authenticate using explicit credentials instead:
> ```powershell
> az login --username <your-username> --password <your-password>
> ```

1. Provision resources:

```powershell
Expand All @@ -98,6 +105,15 @@ Now you'll use the Azure Developer CLI to deploy all required Azure resources.
azd env get-values > .env
```

> ⚠️ **Important – File Encoding**
>
> After generating the `.env` file, make sure it is saved using **UTF-8** encoding.
>
> In editors like **VS Code**, check the encoding indicator in the bottom-right corner.
> If it shows **UTF-16 LE** (or any encoding other than UTF-8), click it, choose **Save with Encoding**, and select **UTF-8**.
>
> Using the wrong encoding may cause environment variables to be read incorrectly.

This creates a `.env` file in your project root with all the provisioned resource information.

### Install Python dependencies
Expand Down
99 changes: 71 additions & 28 deletions docs/04-automated-evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ This exercise takes approximately **40 minutes**.

## Introduction

In this exercise, you'll use Microsoft Foundry's cloud evaluators to automatically assess quality at scale for the Adventure Works Trail Guide Agent. You'll run evaluations against a large test dataset (200 query-response pairs) to validate quality metrics and establish an automated evaluation pipeline for future changes.
In this exercise, you'll use Microsoft Foundry's cloud evaluators to automatically assess quality at scale for the Adventure Works Trail Guide Agent. You'll run evaluations against a large test dataset (89 query-response pairs) to validate quality metrics and establish an automated evaluation pipeline for future changes.

**Scenario**: You're operating the Adventure Works Trail Guide Agent. You want to evaluate it against a large test dataset (200 query-response pairs) to validate quality metrics and establish an automated evaluation pipeline that can scale as your agent evolves.
**Scenario**: You're operating the Adventure Works Trail Guide Agent. You want to evaluate it against a large test dataset (89 query-response pairs) to validate quality metrics and establish an automated evaluation pipeline that can scale as your agent evolves.

You'll use the following evaluation criteria—automated at scale:

Expand Down Expand Up @@ -80,6 +80,13 @@ Now you'll use the Azure Developer CLI to deploy all required Azure resources.

Sign in with your Azure credentials when prompted.

> ⚠️ **Important**
> In some environments, the VS Code integrated terminal may crash or close during the interactive login flow.
> If this happens, authenticate using explicit credentials instead:
> ```powershell
> az login --username <your-username> --password <your-password>
> ```

1. Provision resources:

```powershell
Expand All @@ -104,6 +111,15 @@ Now you'll use the Azure Developer CLI to deploy all required Azure resources.
azd env get-values > .env
```

> ⚠️ **Important – File Encoding**
>
> After generating the `.env` file, make sure it is saved using **UTF-8** encoding.
>
> In editors like **VS Code**, check the encoding indicator in the bottom-right corner.
> If it shows **UTF-16 LE** (or any encoding other than UTF-8), click it, choose **Save with Encoding**, and select **UTF-8**.
>
> Using the wrong encoding may cause environment variables to be read incorrectly.

This creates a `.env` file in your project root with all the provisioned resource information.

### Install Python dependencies
Expand Down Expand Up @@ -159,7 +175,7 @@ Cloud evaluation follows a structured workflow:

### Dataset preparation

The repository includes `data/trail_guide_evaluation_dataset.jsonl` with 200 pre-generated query-response pairs covering diverse hiking scenarios. Each entry includes:
The repository includes `data/trail_guide_evaluation_dataset.jsonl` with 89 pre-generated query-response pairs covering diverse hiking scenarios. Each entry includes:

- `query`: User question
- `response`: Agent-generated answer
Expand Down Expand Up @@ -206,7 +222,7 @@ First, examine the prepared dataset structure.
(Get-Content data/trail_guide_evaluation_dataset.jsonl).Count
```

Expected: 200 entries
Expected: 89 entries

### Understand the evaluation pipeline

Expand All @@ -219,7 +235,7 @@ The script performs all evaluation steps automatically:
1. **Upload Dataset** - Uploads the JSONL dataset to Microsoft Foundry
2. **Define Evaluation** - Creates evaluation definition with quality evaluators (Intent Resolution, Relevance, Groundedness)
3. **Run Evaluation** - Starts the cloud evaluation run
4. **Poll for Completion** - Waits for evaluation to complete (5-10 minutes for 200 items)
4. **Poll for Completion** - Waits for evaluation to complete (5-10 minutes for 89 items)
5. **Display Results** - Retrieves and shows scoring statistics

This single-script approach makes it easy to run evaluations both locally during development and automatically in CI/CD pipelines.
Expand Down Expand Up @@ -279,7 +295,7 @@ Execute the complete evaluation pipeline with one command.
Run ID: run-ghi789rst
Status: running

This may take 5-10 minutes for 200 items...
This may take 5-10 minutes for 89 items...

================================================================================
Step 4: Polling for completion
Expand All @@ -297,9 +313,9 @@ Execute the complete evaluation pipeline with one command.
Report URL: https://<account>.services.ai.azure.com/projects/<project>/evaluations/...

Average Scores (1-5 scale, threshold: 3)
Intent Resolution: 4.52 (n=200)
Relevance: 4.41 (n=200)
Groundedness: 4.18 (n=200)
Intent Resolution: 4.52 (n=89)
Relevance: 4.41 (n=89)
Groundedness: 4.18 (n=89)

Pass Rates (score >= 3)
Intent Resolution: 96.0%
Expand All @@ -316,7 +332,7 @@ Execute the complete evaluation pipeline with one command.
3. Document key findings and recommendations
```

> **Note**: Evaluation runtime varies based on dataset size and model capacity. 200 items typically takes 5-15 minutes.
> **Note**: Evaluation runtime varies based on dataset size and model capacity. 89 items typically takes 5-15 minutes.

1. **Commit the results file**

Expand Down Expand Up @@ -374,27 +390,54 @@ The evaluation script integrates seamlessly into GitHub Actions for automated PR

1. **Configure Azure authentication**

Create a service principal with Foundry project access:
Create a service principal for GitHub Actions:

```powershell
# Create service principal
az ad sp create-for-rbac --name "github-agent-evaluator" `
--role "Azure AI Developer" `
--scopes /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace> `
--sdk-auth
az ad sp create-for-rbac --name "github-agent-evaluator"
```

Configure federated identity for GitHub OIDC:
Note the `appId` value from the output — you will use it in the next steps.

Assign the **Azure AI User** role at the account scope. This role has `Microsoft.CognitiveServices/*` wildcard data actions, which covers the `AIServices/agents/write` action required by the Foundry project evaluation API:

```powershell
az role assignment create `
--assignee "<appId>" `
--role "Azure AI User" `
--scope "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.CognitiveServices/accounts/<ai-account-name>"
```

> **Important**: Use the `AZURE_AI_ACCOUNT_NAME` value from your `.env` file as `<ai-account-name>`. The `Azure AI Developer` role is **not sufficient** — it only covers `OpenAI/*`, `SpeechServices/*`, `ContentSafety/*`, and `MaaS/*` data actions, but not `AIServices/agents/write` which the Foundry project API requires.

> **Tip**: If you set the optional `githubActionsPrincipalId` parameter when running `azd up`, the infrastructure deployment will create this role assignment automatically for future environments.

Configure federated identity for GitHub OIDC so the workflow can authenticate without a secret.

Create a file named `federated-credential.json` in your repository root:

```json
{
"name": "github-actions",
"issuer": "https://token.actions.githubusercontent.com",
"subject": "repo:<your-org>/<your-repo>:ref:refs/heads/main",
"audiences": ["api://AzureADTokenExchange"]
}
```

> **Note**: Replace `<your-org>/<your-repo>` with your exact GitHub username and repository name. The subject is case-sensitive and must match exactly.

Register the federated credential using the file:

```powershell
az ad app federated-credential create `
--id <app-id> `
--parameters '{
"name": "github-actions",
"issuer": "https://token.actions.githubusercontent.com",
"subject": "repo:<your-org>/<your-repo>:ref:refs/heads/main",
"audiences": ["api://AzureADTokenExchange"]
}'
--id "<appId>" `
--parameters @federated-credential.json
```

Once the credential is created successfully, delete the file — it contains no secrets but there is no reason to keep it in the repository:

```powershell
Remove-Item federated-credential.json
```

1. **Review the PR evaluation workflow**
Expand Down Expand Up @@ -483,7 +526,7 @@ Document your findings and create an analysis report.

## Evaluation Summary

Evaluated: 200 test cases
Evaluated: 89 test cases
Time: ~10 minutes
Scoring: GPT-4.1 as LLM judge (1-5 scale)

Expand Down Expand Up @@ -521,7 +564,7 @@ Document your findings and create an analysis report.

- **Scales** to hundreds/thousands of items efficiently
- **Consistent** scoring criteria across all evaluations
- **Fast** turnaround (10 minutes for 200 items)
- **Fast** turnaround (10 minutes for 89 items)
- **Repeatable** and trackable over time
- **CI/CD ready** for integration into deployment pipelines
- **Detailed reasoning** provided for each score
Expand Down Expand Up @@ -580,7 +623,7 @@ Compare evaluation results between GPT-4.1 and GPT-4.1-mini to understand qualit

### Run evaluation on GPT-4.1-mini responses

1. Generate 200 responses from GPT-4.1-mini for the same queries.
1. Generate 89 responses from GPT-4.1-mini for the same queries.

1. Run cloud evaluation on both sets.

Expand Down Expand Up @@ -610,7 +653,7 @@ Create `experiments/automated/model_comparison.md` with:

**Resolution**:
- Run `az login` to refresh Azure credentials
- Verify you have **Azure AI User** role on the Foundry project
- Verify the service principal has the **Azure AI User** role at the CognitiveServices account scope — this role has `Microsoft.CognitiveServices/*` wildcard data actions required for `AIServices/agents/write`. `Azure AI Developer` alone is **not sufficient**
- Check `AZURE_AI_PROJECT_ENDPOINT` in `.env` file is correct and includes `/api/projects/<project>`

### Evaluator scoring seems inconsistent
Expand Down
Loading
Loading