Replies: 2 comments
-
|
Response from ADK Answering Agent (experimental, answer may be inaccurate) TLDR: The Hello! I see you're having some trouble with the The behavior you're observing is expected because the In your case, even though the tool calls might look similar, the differing IDs in the arguments are causing the exact match to fail, which results in a score of 0.0 [2]. To resolve this, you have a couple of options:
You can specify the match type in your evaluation configuration [2]. I hope this helps clarify why you are seeing a score of 0.0! [1] ADK Evaluate Documentation |
Beta Was this translation helpful? Give feedback.
-
|
Here is the diff I am seeing
----------------------------- Captured stdout call -----------------------------
Summary: `EvalStatus.FAILED` for Metric: `tool_trajectory_avg_score`. Expected threshold: `0.5`, actual value: `0.0`.
+----+-------------------+---------+-------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| | eval_status | score | threshold | prompt | expected_response | actual_response | expected_tool_calls | actual_tool_calls |
+====+===================+=========+=============+===========================+===========================+===========================+===========================+===========================+
| 0 | EvalStatus.FAILED | 0 | 0.5 | what is https://github.co<https://github.co/> | This repository contains | This repository is the | id=None | id='adk-18c8583a-2d21-4de |
| | | | | m/VandanaJn/chatbot- | a backend service for a | backend for a chatbot. It | args={'github_url': 'http | 9-a9e4-a860368cbae8' |
| | | | | backend about | chatbot. The key | contains an `app.py`, | s://github.com/VandanaJn/ | args={'github_url': 'http |
| | | | | | components are: - | which is likely the main | chatbot-backend'} name='e | s://github.com/VandanaJn/ |
| | | | | | **app.py**: The main | application entry point, | xtract_owner_and_repo' | chatbot-backend'} name='e |
| | | | | | entry point for the | and | partial_args=None | xtract_owner_and_repo' |
| | | | | | service. - **milvus_integ | `milvus_integration.py`, | will_continue=None | partial_args=None |
| | | | | | ration.py**: A module | which suggests it | id=None | will_continue=None id='ad |
| | | | | | designed to connect to a | integrates with the | args={'agent_name': 'code | k-72ffc278-3c98-4ef3-baf4 |
| | | | | | Milvus vector database, | Milvus vector database | _architecture_agent'} | -8a4553450277' |
| | | | | | which is used for vector | for its operations. | name='transfer_to_agent' | args={'github_url': 'http |
| | | | | | storage and similarity | | partial_args=None | s://github.com/VandanaJn/ |
| | | | | | search. | | will_continue=None | chatbot-backend', |
| | | | | | | | id=None | 'original_user_question': |
| | | | | | | | args={'repo_name': | 'what is https://github.c |
| | | | | | | | 'chatbot-backend', | om/VandanaJn/chatbot- |
| | | | | | | | 'owner': 'VandanaJn', | backend about', |
| | | | | | | | 'max_depth': 2} | 'agent_name': 'code_archi |
| | | | | | | | name='get_repo_structure' | tecture_agent', 'owner': |
| | | | | | | | partial_args=None | 'VandanaJn', 'repo': |
| | | | | | | | will_continue=None | 'chatbot-backend'} |
| | | | | | | | | name='transfer_to_agent' |
| | | | | | | | | partial_args=None |
| | | | | | | | | will_continue=None id='ad |
| | | | | | | | | k-3c4f1994-5ea8-4f99-a945 |
| | | | | | | | | -2e1d7eed596f' |
| | | | | | | | | args={'owner': |
| | | | | | | | | 'VandanaJn', 'max_depth': |
| | | | | | | | | 2, 'repo_name': 'chatbot- |
| | | | | | | | | backend'} |
| | | | | | | | | name='get_repo_structure' |
| | | | | | | | | partial_args=None |
| | | | | | | | | will_continue=None |
And this was my json
{
"eval_set_id": "converted_eval_set_v2",
"name": "Converted Evaluation Set V2",
"creation_timestamp": 1678886400.0,
"eval_cases": [
{
"eval_id": "case_2",
"conversation": [
{
"invocation_id": "invocation-case-2",
"user_content": {
"role": "user",
"parts": [
{
"text": "what is https://github.com/VandanaJn/chatbot-backend about"
}
]
},
"final_response": {
"role": "model",
"parts": [
{
"text": "This repository contains a backend service for a chatbot. The key components are: - **app.py**: The main entry point for the service. - **milvus_integration.py**: A module designed to connect to a Milvus vector database, which is used for vector storage and similarity search."
}
]
},
"intermediate_data": {
"tool_uses": [
{
"name": "extract_owner_and_repo",
"args": {
"github_url": "https://github.com/VandanaJn/chatbot-backend"
}
},
{
"name": "transfer_to_agent",
"args": {
"agent_name": "code_architecture_agent",
"github_url": "https://github.com/VandanaJn/chatbot-backend",
"original_user_question": "what is https://github.com/VandanaJn/chatbot-backend about"
}
},
{
"name": "get_repo_structure",
"args": {
"repo_name": "chatbot-backend",
"owner": "VandanaJn",
"max_depth": 2
}
}
],
"intermediate_responses": []
},
"creation_timestamp": 1678886403.0
}
]
}
]
}
And test config
{
"criteria": {
"tool_trajectory_avg_score": 0.5
}
}
Please suggest what can I do, will need example, as i have spent quite some time. I can't get automatically generated ids.
…________________________________
From: adk-bot ***@***.***>
Sent: Wednesday, December 3, 2025 3:11 PM
To: google/adk-python ***@***.***>
Cc: VandanaJn ***@***.***>; Author ***@***.***>
Subject: Re: [google/adk-python] AgentEvaluator.evaluate trajectory avg score fails (Discussion #3809)
Response from ADK Answering Agent (experimental, answer may be inaccurate)
TLDR: The tool_trajectory_avg_score metric uses an exact match by default, so any difference in tool call arguments, including IDs, will result in a score of 0.0.
________________________________
Hello! I see you're having some trouble with the tool_trajectory_avg_score.
The behavior you're observing is expected because the tool_trajectory_avg_score metric performs an exact match on the tool call trajectory by default [1, 2]. This means that for the score to be higher than 0.0, the tool calls in the actual trajectory must perfectly match the tool calls in the expected trajectory, including the tool name, arguments, and the order of the calls.
In your case, even though the tool calls might look similar, the differing IDs in the arguments are causing the exact match to fail, which results in a score of 0.0 [2].
To resolve this, you have a couple of options:
1. Align IDs in your test data: If the specific ID values are important for your evaluation, ensure that the expected tool calls in your test data have the exact same IDs as the ones generated by your agent.
2. Use a different match type: If the exact order and arguments are not strictly necessary for your evaluation, you can configure the evaluator to use a less strict match type. The available match types are:
* EXACT: (Default) The actual tool calls must be a perfect match with the expected tool calls.
* IN_ORDER: The actual tool calls must contain the expected tool calls in the same order, but other tool calls can occur in between.
* ANY_ORDER: The actual tool calls must contain the expected tool calls in any order.
You can specify the match type in your evaluation configuration [2].
I hope this helps clarify why you are seeing a score of 0.0!
________________________________
[1] ADK Evaluate Documentation<https://google.github.io/adk-docs/evaluate/>
[2] ADK Trajectory Evaluator Source<https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/trajectory_evaluator.py>
—
Reply to this email directly, view it on GitHub<#3809 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AHKJIYMZ4ZOGCSOVCQ5MDMT3747YRAVCNFSM6AAAAACN6T3RG6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTKMJVGMYTKNA>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Has anyone seen a behavior where you get an error like tool_trajectory_avg_score for None Failed. Expected 0.5, but got 0.0. where as if you look at expected v/s actual, evrything looks same except for ids. Why does it give got 0, doesn't make sense, at least it should have been little higher than 0.
Beta Was this translation helpful? Give feedback.
All reactions