Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
8307420
Add agent responses JSON, refactor Bicep resources, enable hosted age…
v-vfarias Feb 26, 2026
c9c85df
Merge pull request #1 from v-vfarias/experiment
v-vfarias Feb 26, 2026
494fc13
Add baseline agent responses JSON for hiking and camping guidance
v-vfarias Feb 26, 2026
fa7441d
Configure agent to use v4 optimized-concise prompt
v-vfarias Feb 26, 2026
5dcb42f
Complete optimized-concise experiment with evaluation
v-vfarias Feb 26, 2026
5739715
Configure agent to use GPT-4.1-mini model with v4 prompt
v-vfarias Feb 26, 2026
0140949
Complete GPT-4.1-mini experiment with evaluation
v-vfarias Feb 26, 2026
61a6d1d
Merge pull request #2 from v-vfarias/try/gpt41mini
v-vfarias Feb 26, 2026
8fa496e
Merge pull request #3 from v-vfarias/experiment
v-vfarias Feb 26, 2026
6a559a0
Remove obsolete agent responses and evaluation files from baseline an…
v-vfarias Feb 26, 2026
42d1ea3
Merge branch 'main' of https://github.com/v-vfarias/mslearn-genaiops
v-vfarias Feb 26, 2026
cdaf955
Enable automated PR evaluations
v-vfarias Feb 26, 2026
aa32091
Enable automated PR evaluations
v-vfarias Feb 26, 2026
f99d6b2
Update model configuration and dataset references
v-vfarias Feb 26, 2026
aeda550
Merge branch 'main' of https://github.com/v-vfarias/mslearn-genaiops
v-vfarias Feb 26, 2026
3e7c8ff
test: Trigger evaluation workflow
v-vfarias Feb 26, 2026
4151ba5
Refactor agent responses and evaluation scripts
v-vfarias Feb 27, 2026
cf600c6
Updated evalueion_results.txt
v-vfarias Feb 27, 2026
da09adc
updated yml for testing mannually
v-vfarias Feb 27, 2026
e9087bd
refactor: Improve error reporting and remove debug prints in evaluati…
v-vfarias Feb 27, 2026
15e8ce7
Clean-up files prior to merge
v-vfarias Feb 27, 2026
a1807af
Updated the file to old version
v-vfarias Feb 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/evaluate-agent.yml
Original file line number Diff line number Diff line change
Expand Up @@ -82,4 +82,4 @@ jobs:
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
});
4 changes: 2 additions & 2 deletions docs/02-prompt-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,8 +103,8 @@ Now you'll use the Azure Developer CLI to deploy all required Azure resources.
1. Add the agent configuration to your `.env` file:

```
AGENT_NAME=trail-guide
MODEL_NAME=gpt-4.1
AGENT_NAME="trail-guide"
MODEL_NAME="gpt-4.1"
```

### Install Python dependencies
Expand Down
8 changes: 4 additions & 4 deletions docs/03-design-optimize-prompts.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,8 +127,8 @@ With your Azure resources deployed, install the required Python packages.
Open the `.env` file in your repository root and add:

```
AGENT_NAME=trail-guide
MODEL_NAME=gpt-4.1
AGENT_NAME="trail-guide"
MODEL_NAME="gpt-4.1"
```

## Understand the experimental workflow
Expand Down Expand Up @@ -232,7 +232,7 @@ The baseline provides:

Create your baseline evaluation scores.

1. Create `experiments/baseline/evaluation.csv`:
1. Check if it created or create `experiments/baseline/evaluation.csv`:

```csv
test_prompt,agent_response_excerpt,intent_resolution,relevance,groundedness,comments
Expand Down Expand Up @@ -404,7 +404,7 @@ Review the agent responses and create an evaluation CSV with quality scores.
New-Item experiments/optimized-concise/evaluation.csv
```

1. Open the file in VS Code and add the CSV header and scores:
1. Open the file in VS Code and verify or add the CSV header and scores:

```csv
test_prompt,agent_response_excerpt,intent_resolution,relevance,groundedness,comments
Expand Down
10 changes: 10 additions & 0 deletions docs/04-automated-evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,16 @@ Execute the complete evaluation pipeline with one command.

> **Note**: Evaluation runtime varies based on dataset size and model capacity. 200 items typically takes 5-15 minutes.

1. **Commit the results file**

The script writes a summary to `evaluation_results.txt` in your project root. Commit this file so the GitHub Actions workflow can read it when it runs on your PR:

```powershell
git add evaluation_results.txt
git commit -m "Add evaluation results"
git push
```

### Automate with GitHub Actions

The evaluation script integrates seamlessly into GitHub Actions for automated PR evaluations.
Expand Down
75 changes: 75 additions & 0 deletions evaluation_results.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
================================================================================
Trail Guide Agent - Cloud Evaluation
================================================================================

Configuration:
Project: https://ai-account-u2favsrdpp24k.services.ai.azure.com/api/projects/ai-project-dev-tester
Model: gpt-4.1
Dataset: trail-guide-evaluation-dataset (v1)

================================================================================
Step 1: Uploading evaluation dataset
================================================================================

Dataset: trail_guide_evaluation_dataset.jsonl
Uploading...

Dataset version 1 already exists in Foundry.
Retrieving existing dataset ID...
✓ Using existing dataset
Dataset ID: azureai://accounts/ai-account-u2favsrdpp24k/projects/ai-project-dev-tester/data/trail-guide-evaluation-dataset/versions/1

================================================================================
Step 2: Creating evaluation definition
================================================================================

Configuration:
Judge Model: gpt-4.1
Evaluators: Intent Resolution, Relevance, Groundedness

Creating evaluation...

✓ Evaluation definition created
Evaluation ID: eval_7d81e353fce74ab3a51105fac7f59ed8

================================================================================
Step 3: Running cloud evaluation
================================================================================

✓ Evaluation run started
Run ID: evalrun_31596933faa44251a62247465b5d8fbd
Status: in_progress

This may take 5-10 minutes for 200 items...

================================================================================
Step 4: Polling for completion
================================================================================
[3675s] Status: in_progress

✓ Evaluation completed in 3694 seconds

================================================================================
Step 5: Retrieving results
================================================================================

Evaluation Summary
Report URL: https://ai.azure.com/nextgen/r/ZDCVYALRRAyvc0GrsS4tgA,rg-dev-tester,,ai-account-u2favsrdpp24k,ai-project-dev-tester/build/evaluations/eval_7d81e353fce74ab3a51105fac7f59ed8/run/evalrun_31596933faa44251a62247465b5d8fbd

[DEBUG] First item type : <class 'openai.types.evals.runs.output_item_list_response.OutputItemListResponse'>
[DEBUG] First item attrs: ['construct', 'copy', 'created_at', 'datasource_item', 'datasource_item_id', 'dict', 'eval_id', 'from_orm', 'id', 'json', 'model_computed_fields', 'model_config', 'model_construct', 'model_copy', 'model_dump', 'model_dump_json', 'model_extra', 'model_fields', 'model_fields_set', 'model_json_schema', 'model_parametrized_name', 'model_post_init', 'model_rebuild', 'model_validate', 'model_validate_json', 'model_validate_strings', 'object', 'parse_file', 'parse_obj', 'parse_raw', 'results', 'run_id', 'sample', 'schema', 'schema_json', 'status', 'to_dict', 'to_json', 'update_forward_refs', 'validate']
[DEBUG] item.__dict__: {'id': '1', 'created_at': 1772201992, 'datasource_item': {'query': 'What essential gear do I need for a summer day hike?', 'response': 'For a summer day hike, essential gear includes: proper hiking boots with good ankle support, moisture-wicking clothing in layers, a daypack (20-30L), 2 liters of water, high-energy snacks, sun protection (hat, sunglasses, sunscreen SPF 30+), a basic first aid kit, map and compass or GPS device, headlamp with extra batteries, and a whistle for emergencies. Always check the weather forecast and trail conditions before heading out.', 'ground_truth': 'Essential day hike gear includes footwear, water, food, sun protection, navigation tools, first aid, and emergency supplies.'}, 'datasource_item_id': 0, 'eval_id': 'eval_7d81e353fce74ab3a51105fac7f59ed8', 'object': 'eval.run.output_item', 'results': [Result(name='intent_resolution', passed=True, score=5.0, sample={'usage': {'prompt_tokens': 1984, 'completion_tokens': 61, 'total_tokens': 2045}, 'finish_reason': 'stop', 'model': 'gpt-4.1-2025-04-14', 'input': [{'role': 'user', 'content': '{"query": "What essential gear do I need for a summer day hike?", "response": "For a summer day hike, essential gear includes: proper hiking boots with good ankle support, moisture-wicking clothing in layers, a daypack (20-30L), 2 liters of water, high-energy snacks, sun protection (hat, sunglasses, sunscreen SPF 30+), a basic first aid kit, map and compass or GPS device, headlamp with extra batteries, and a whistle for emergencies. Always check the weather forecast and trail conditions before heading out.", "tool_definitions": null}'}], 'output': [{'role': 'assistant', 'content': '{\n "explanation": "The user asked for essential gear for a summer day hike. The agent provided a thorough, accurate, and relevant list, including safety, hydration, navigation, and sun protection, fully resolving the user\'s intent with no notable omissions.",\n "score": 5\n}'}]}, type='azure_ai_evaluator', metric='intent_resolution', label='pass', reason="The user asked for essential gear for a summer day hike. The agent provided a thorough, accurate, and relevant list, including safety, hydration, navigation, and sun protection, fully resolving the user's intent with no notable omissions.", threshold=3), Result(name='relevance', passed=True, score=5.0, sample={'usage': {'prompt_tokens': 1678, 'completion_tokens': 63, 'total_tokens': 1741}, 'finish_reason': 'stop', 'model': 'gpt-4.1-2025-04-14', 'input': [{'role': 'user', 'content': '{"query": "What essential gear do I need for a summer day hike?", "response": "For a summer day hike, essential gear includes: proper hiking boots with good ankle support, moisture-wicking clothing in layers, a daypack (20-30L), 2 liters of water, high-energy snacks, sun protection (hat, sunglasses, sunscreen SPF 30+), a basic first aid kit, map and compass or GPS device, headlamp with extra batteries, and a whistle for emergencies. Always check the weather forecast and trail conditions before heading out."}'}], 'output': [{'role': 'assistant', 'content': '{\n "explanation": "The response provides a thorough list of essential gear for a summer day hike, including clothing, hydration, navigation, safety, and sun protection. It also adds practical advice about checking weather and trail conditions, offering both completeness and useful context.",\n "score": 5\n}'}]}, type='azure_ai_evaluator', metric='relevance', label='pass', reason='The response provides a thorough list of essential gear for a summer day hike, including clothing, hydration, navigation, safety, and sun protection. It also adds practical advice about checking weather and trail conditions, offering both completeness and useful context.', threshold=3), Result(name='groundedness', passed=True, score=5.0, sample={'usage': {'prompt_tokens': 1571, 'completion_tokens': 106, 'total_tokens': 1677}, 'finish_reason': 'stop', 'model': 'gpt-4.1-2025-04-14', 'input': [{'role': 'user', 'content': '{"query": "What essential gear do I need for a summer day hike?", "response": "For a summer day hike, essential gear includes: proper hiking boots with good ankle support, moisture-wicking clothing in layers, a daypack (20-30L), 2 liters of water, high-energy snacks, sun protection (hat, sunglasses, sunscreen SPF 30+), a basic first aid kit, map and compass or GPS device, headlamp with extra batteries, and a whistle for emergencies. Always check the weather forecast and trail conditions before heading out.", "context": "For a summer day hike, essential gear includes: proper hiking boots with good ankle support, moisture-wicking clothing in layers, a daypack (20-30L), 2 liters of water, high-energy snacks, sun protection (hat, sunglasses, sunscreen SPF 30+), a basic first aid kit, map and compass or GPS device, headlamp with extra batteries, and a whistle for emergencies. Always check the weather forecast and trail conditions before heading out."}'}], 'output': [{'role': 'assistant', 'content': "<S0>Let's think step by step: The query asks for the essential gear needed for a summer day hike. The context provides a detailed list of essential gear and some additional advice. The response repeats the context almost verbatim, listing all the gear and advice without omitting or adding any information. There are no inaccuracies or missing details.</S0>\n<S1>The response is fully accurate, complete, and directly grounded in the provided context, addressing the query thoroughly.</S1>\n<S2>5</S2>"}]}, type='azure_ai_evaluator', metric='groundedness', label='pass', reason='The response is fully accurate, complete, and directly grounded in the provided context, addressing the query thoroughly.', threshold=3)], 'run_id': 'evalrun_31596933faa44251a62247465b5d8fbd', 'sample': Sample(error=None, finish_reason='stop', input=[SampleInput(content='{"query": "What essential gear do I need for a summer day hike?", "response": "For a summer day hike, essential gear includes: proper hiking boots with good ankle support, moisture-wicking clothing in layers, a daypack (20-30L), 2 liters of water, high-energy snacks, sun protection (hat, sunglasses, sunscreen SPF 30+), a basic first aid kit, map and compass or GPS device, headlamp with extra batteries, and a whistle for emergencies. Always check the weather forecast and trail conditions before heading out.", "context": "For a summer day hike, essential gear includes: proper hiking boots with good ankle support, moisture-wicking clothing in layers, a daypack (20-30L), 2 liters of water, high-energy snacks, sun protection (hat, sunglasses, sunscreen SPF 30+), a basic first aid kit, map and compass or GPS device, headlamp with extra batteries, and a whistle for emergencies. Always check the weather forecast and trail conditions before heading out."}', role='user')], max_completion_tokens=None, model='gpt-4.1-2025-04-14', output=[SampleOutput(content="<S0>Let's think step by step: The query asks for the essential gear needed for a summer day hike. The context provides a detailed list of essential gear and some additional advice. The response repeats the context almost verbatim, listing all the gear and advice without omitting or adding any information. There are no inaccuracies or missing details.</S0>\n<S1>The response is fully accurate, complete, and directly grounded in the provided context, addressing the query thoroughly.</S1>\n<S2>5</S2>", role='assistant')], seed=None, temperature=None, top_p=None, usage=SampleUsage(cached_tokens=None, completion_tokens=106, prompt_tokens=1571, total_tokens=1677)), 'status': 'completed'}
================================================================================
Trail Guide Agent - Evaluation Results
================================================================================

Report URL : https://ai.azure.com/nextgen/r/ZDCVYALRRAyvc0GrsS4tgA,rg-dev-tester,,ai-account-u2favsrdpp24k,ai-project-dev-tester/build/evaluations/eval_7d81e353fce74ab3a51105fac7f59ed8/run/eva

================================================================================
Cloud evaluation complete
================================================================================

Next steps:
1. Review detailed results in Azure AI Foundry portal
2. Analyze patterns in successful and failed evaluations
3. Commit evaluation_results.txt and push so the PR workflow can use it
6 changes: 0 additions & 6 deletions experiments/baseline/evaluation.csv

This file was deleted.

6 changes: 0 additions & 6 deletions experiments/gpt41mini/evaluation.csv

This file was deleted.

6 changes: 0 additions & 6 deletions experiments/optimized-concise/evaluation.csv

This file was deleted.

Loading