Fix builtin evaluator issues flagged in PR #531 by nadheesh · Pull Request #598 · wso2/agent-manager

nadheesh · 2026-03-19T12:49:36Z

Summary

Fixes all issues tracked in #547.

SDK (amp-evaluation)

Guard zero/negative task.constraints overrides before division in TokenEfficiencyEvaluator and IterationCountEvaluator — prevents ZeroDivisionError when task constraints supply an invalid value
Raise TypeError at init time in _init_function_params when a required Param (no default) is not provided, instead of deferring the error silently to call time
Replace self._format_success_criteria(task) in InstructionFollowingEvaluator.build_prompt with an equivalent inline expression using chr(10) to stay compatible with Python < 3.12
Clarify evaluator_schema.py docs: only comprehension expressions are supported inside {}, not loop statements

Generator (generate-builtin-evaluators.sh)

Handle return prompt (variable assigned an f-string, then returned) patterns that previously produced empty Source in the catalog
Decode Python escape sequences (\\n → newline) in captured conditional f-string values before inlining into prompt templates — fixes literal \\nTask: and \\nContext: in generated catalog entries
Skip inlining conditional sections whose value depends solely on self.* config params; removes the spurious empty Context: line from safety/tone templates
Create output directory for LLM_JUDGE_BASE_FILE before writing to prevent failures when --output points to a non-default location

Generated

Regenerate builtin_evaluators.go to reflect all template fixes

Test plan

python -m pytest in libs/amp-evaluation — 696 passed
go build ./... in agent-manager-service — clean
Verify instruction_following catalog entry shows Success criteria: {(...)} expression
Verify path_efficiency and safety/tone catalog entries no longer contain \\nTask: or \\nContext:

Summary by CodeRabbit

Bug Fixes
- Fixed token and iteration efficiency evaluators to use configured defaults when constraint values are zero or negative.
- Corrected prompt formatting for newlines and success criteria display in LLM-judge evaluators.
Documentation
- Clarified that only Python expressions (not loop statements) are supported in prompt templates.

coderabbitai · 2026-03-19T12:50:00Z

📝 Walkthrough

Walkthrough

This PR hardens evaluators against invalid constraint values by adding zero-and-negative guards for iteration/token limits, improves prompt template extraction in the code generator, enforces required parameter validation during decorator initialization, refines LLM judge prompt formatting, and expands test coverage for fallback scenarios.

Changes

Cohort / File(s)	Summary
Builtin Evaluator Sources `agent-manager-service/catalog/builtin_evaluators.go`	Added guards in embedded Python code for `iteration_efficiency` and `token_efficiency` evaluators to reset non-positive constraint values to configured defaults before computing scores. Modified LLM judge prompts (`completeness`, `helpfulness`, `instruction_following`, etc.) to fix newline escaping and update success-criteria rendering logic.
Evaluator Generator Script `agent-manager-service/scripts/generate-builtin-evaluators.sh`	Enhanced `_extract_prompt_template()` to track intermediate f-string assignments, extract the first (rather than last) template match, decode escape sequences in conditional blocks, skip self-referential conditionals, and create output directory before file generation.
Evaluation Library - Core & Base `libs/amp-evaluation/src/amp_evaluation/evaluators/base.py`, `libs/amp-evaluation/src/amp_evaluation/codegen/evaluator_schema.py`	Added required-parameter validation in `_FunctionParamsMixin._init_function_params()` to raise `TypeError` for missing required `Param` descriptors. Updated guide text to clarify that loop statements are not supported in curly-brace expressions, only Python expressions.
Evaluation Library - Builtin Implementations `libs/amp-evaluation/src/amp_evaluation/evaluators/builtin/standard.py`, `libs/amp-evaluation/src/amp_evaluation/evaluators/builtin/llm_judge.py`	Added guards in `TokenEfficiencyEvaluator` and `IterationCountEvaluator.evaluate()` to reset zero or negative constraint values to configured limits. Inlined success-criteria formatting logic in `InstructionFollowingEvaluator.build_prompt()` with newline-delimited bullet formatting and fallback text.
Test Coverage `libs/amp-evaluation/tests/test_evaluators_builtin_core.py`, `libs/amp-evaluation/tests/test_llm_as_judge.py`	Added tests for zero-constraint fallback behavior in token and iteration efficiency evaluators. Added tests to verify that missing required `Param` descriptors raise `TypeError` and that supplied required parameters are properly stored in config.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related issues

Fix generated builtin evaluator issues flagged in PR #531 #547: This PR directly addresses the issue by implementing guard clauses for non-positive max_iterations/max_tokens, fixing prompt template extraction and newline escaping, updating success-criteria formatting, and enforcing required parameter validation in the evaluation library.

Poem

🐰 A rabbit's ode to safer bounds,
Where zero guards prevent confounds,
Each template now extracted true,
Required params checked right through,
Prompts escape-free, scores rebound! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 68.75% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The description provides a clear summary section addressing the key fixes across SDK, generator, and generated code, plus a test plan section. However, it does not follow the full required template structure (missing Purpose, Goals, Approach, and other sections).	Consider expanding the description to include standard sections like Purpose (with issue links), Goals, Approach, and other relevant template sections for consistency with repository conventions.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically references fixing issues flagged in PR `#531`, which directly relates to the core objectives of the changeset including evaluator guards, error handling, and template fixes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

agent-manager-service/scripts/generate-builtin-evaluators.sh (1)
170-206: Prefer the returned AST node over matches[0].

You now collect the actual returned JoinedStr in fstrings, but Line 205 still assumes the first triple-quoted regex match is the prompt. A future helper like header = f"""...""" before return prompt will extract the wrong template even though the AST already knows which f-string is returned.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent-manager-service/scripts/generate-builtin-evaluators.sh` around lines
170 - 206, The code currently finds returned f-strings in the AST (fstrings /
assigned_fstrings) but then ignores them and extracts the template by regex from
the raw source (fstring_pattern / matches), which can pick the wrong
triple-quoted string; instead, locate the actual JoinedStr node from fstrings
(the returned AST node from func) and reconstruct the template from that node
(e.g., by mapping its Constant/FormattedValue parts back to source slices or
joining their .s/.value representations) rather than using matches[0]; update
the logic that sets template to prefer the AST-derived fstrings[0] content and
only fall back to regex matches when no suitable AST node exists.
libs/amp-evaluation/tests/test_evaluators_builtin_core.py (1)
483-497: Consider covering -1 as well as 0.

These regressions only exercise the zero case, but the implementation guards <= 0. Parameterizing the tests with 0 and -1 would pin the negative-constraint path too.

Also applies to: 535-551
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/amp-evaluation/tests/test_evaluators_builtin_core.py` around lines 483 -
497, Parameterize the failing-edge test to cover both 0 and -1: update
test_zero_task_constraint_falls_back_to_config to use pytest.mark.parametrize
with max_tokens_override values (0, -1), construct
Constraints.model_construct(max_tokens=max_tokens_override) and assert the
evaluator (TokenEfficiencyEvaluator.max_tokens=200) still yields passed True and
score 1.0; apply the same parametric change to the other analogous zero-case
test in the same test module that currently only checks 0 so the
negative-constraint path (<=0) is exercised as well.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@libs/amp-evaluation/src/amp_evaluation/evaluators/base.py`:
- Around line 742-751: The missing_required check runs before a clone via
_FunctionParamsMixin.with_config() can rehydrate/merge the existing
_func_config, causing required Params to appear missing; update the logic so
that _func_config is fully merged/rehydrated (via
_FunctionParamsMixin.with_config() or the same merge routine) before computing
missing_required from _func_param_descriptors and _func_config (affecting
FunctionEvaluator/FunctionLLMJudge cloning and the error raised referencing
func.__name__); in practice move or defer the missing_required
calculation/TypeError raise until after the config-merge step (or explicitly
perform the merge first) so judge.with_config(...) and similar clones no longer
fail.

---

Nitpick comments:
In `@agent-manager-service/scripts/generate-builtin-evaluators.sh`:
- Around line 170-206: The code currently finds returned f-strings in the AST
(fstrings / assigned_fstrings) but then ignores them and extracts the template
by regex from the raw source (fstring_pattern / matches), which can pick the
wrong triple-quoted string; instead, locate the actual JoinedStr node from
fstrings (the returned AST node from func) and reconstruct the template from
that node (e.g., by mapping its Constant/FormattedValue parts back to source
slices or joining their .s/.value representations) rather than using matches[0];
update the logic that sets template to prefer the AST-derived fstrings[0]
content and only fall back to regex matches when no suitable AST node exists.

In `@libs/amp-evaluation/tests/test_evaluators_builtin_core.py`:
- Around line 483-497: Parameterize the failing-edge test to cover both 0 and
-1: update test_zero_task_constraint_falls_back_to_config to use
pytest.mark.parametrize with max_tokens_override values (0, -1), construct
Constraints.model_construct(max_tokens=max_tokens_override) and assert the
evaluator (TokenEfficiencyEvaluator.max_tokens=200) still yields passed True and
score 1.0; apply the same parametric change to the other analogous zero-case
test in the same test module that currently only checks 0 so the
negative-constraint path (<=0) is exercised as well.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 00361a0b-05aa-475f-8a2e-fdef9333f845

📥 Commits

Reviewing files that changed from the base of the PR and between bff3514 and c1056d0.

📒 Files selected for processing (8)

agent-manager-service/catalog/builtin_evaluators.go
agent-manager-service/scripts/generate-builtin-evaluators.sh
libs/amp-evaluation/src/amp_evaluation/codegen/evaluator_schema.py
libs/amp-evaluation/src/amp_evaluation/evaluators/base.py
libs/amp-evaluation/src/amp_evaluation/evaluators/builtin/llm_judge.py
libs/amp-evaluation/src/amp_evaluation/evaluators/builtin/standard.py
libs/amp-evaluation/tests/test_evaluators_builtin_core.py
libs/amp-evaluation/tests/test_llm_as_judge.py

coderabbitai · 2026-03-19T13:01:06Z

libs/amp-evaluation/src/amp_evaluation/evaluators/base.py

+        # Raise immediately for required Params (no default) that were not provided
+        missing_required = [
+            name
+            for name, p in self._func_param_descriptors.items()
+            if p.default is _NO_DEFAULT and name not in self._func_config
+        ]
+        if missing_required:
+            raise TypeError(
+                f"Evaluator function '{func.__name__}' missing required parameter(s): {', '.join(missing_required)}"
+            )


⚠️ Potential issue | 🟠 Major

Preserve required function Params when cloning with with_config().

Line 742 now raises before _FunctionParamsMixin.with_config() rehydrates the existing _func_config, so any FunctionEvaluator/FunctionLLMJudge with a required function Param becomes unclonable. Even judge.with_config(criteria="x") still fails because criteria is only merged after construction.

💡 Suggested fix

- new_instance = type(self)(self.func, name=self.name, **merged_class) + new_instance = type(self)( + self.func, + name=self.name, + **merged_class, + **self._func_config, + **func_overrides, + ) new_instance.description = self.description new_instance.tags = list(self.tags) new_instance.version = self.version if self._aggregations: new_instance._aggregations = list(self._aggregations) - new_instance._func_config = {**self._func_config, **func_overrides} return new_instance

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@libs/amp-evaluation/src/amp_evaluation/evaluators/base.py` around lines 742 - 751, The missing_required check runs before a clone via _FunctionParamsMixin.with_config() can rehydrate/merge the existing _func_config, causing required Params to appear missing; update the logic so that _func_config is fully merged/rehydrated (via _FunctionParamsMixin.with_config() or the same merge routine) before computing missing_required from _func_param_descriptors and _func_config (affecting FunctionEvaluator/FunctionLLMJudge cloning and the error raised referencing func.__name__); in practice move or defer the missing_required calculation/TypeError raise until after the config-merge step (or explicitly perform the merge first) so judge.with_config(...) and similar clones no longer fail.

Fix builtin evaluator issues flagged in PR wso2#531

c1056d0

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix builtin evaluator issues flagged in PR #531#598

Fix builtin evaluator issues flagged in PR #531#598
nadheesh wants to merge 1 commit intowso2:mainfrom
nadheesh:main

nadheesh commented Mar 19, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nadheesh commented Mar 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nadheesh commented Mar 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 19, 2026 •

edited

Loading