Skip to content

Fix builtin evaluator issues flagged in PR #531#598

Open
nadheesh wants to merge 1 commit intowso2:mainfrom
nadheesh:main
Open

Fix builtin evaluator issues flagged in PR #531#598
nadheesh wants to merge 1 commit intowso2:mainfrom
nadheesh:main

Conversation

@nadheesh
Copy link
Contributor

@nadheesh nadheesh commented Mar 19, 2026

Summary

Fixes all issues tracked in #547.

SDK (amp-evaluation)

  • Guard zero/negative task.constraints overrides before division in TokenEfficiencyEvaluator and IterationCountEvaluator — prevents ZeroDivisionError when task constraints supply an invalid value
  • Raise TypeError at init time in _init_function_params when a required Param (no default) is not provided, instead of deferring the error silently to call time
  • Replace self._format_success_criteria(task) in InstructionFollowingEvaluator.build_prompt with an equivalent inline expression using chr(10) to stay compatible with Python < 3.12
  • Clarify evaluator_schema.py docs: only comprehension expressions are supported inside {}, not loop statements

Generator (generate-builtin-evaluators.sh)

  • Handle return prompt (variable assigned an f-string, then returned) patterns that previously produced empty Source in the catalog
  • Decode Python escape sequences (\\n → newline) in captured conditional f-string values before inlining into prompt templates — fixes literal \\nTask: and \\nContext: in generated catalog entries
  • Skip inlining conditional sections whose value depends solely on self.* config params; removes the spurious empty Context: line from safety/tone templates
  • Create output directory for LLM_JUDGE_BASE_FILE before writing to prevent failures when --output points to a non-default location

Generated

  • Regenerate builtin_evaluators.go to reflect all template fixes

Test plan

  • python -m pytest in libs/amp-evaluation — 696 passed
  • go build ./... in agent-manager-service — clean
  • Verify instruction_following catalog entry shows Success criteria: {(...)} expression
  • Verify path_efficiency and safety/tone catalog entries no longer contain \\nTask: or \\nContext:

Summary by CodeRabbit

  • Bug Fixes

    • Fixed token and iteration efficiency evaluators to use configured defaults when constraint values are zero or negative.
    • Corrected prompt formatting for newlines and success criteria display in LLM-judge evaluators.
  • Documentation

    • Clarified that only Python expressions (not loop statements) are supported in prompt templates.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 19, 2026

📝 Walkthrough

Walkthrough

This PR hardens evaluators against invalid constraint values by adding zero-and-negative guards for iteration/token limits, improves prompt template extraction in the code generator, enforces required parameter validation during decorator initialization, refines LLM judge prompt formatting, and expands test coverage for fallback scenarios.

Changes

Cohort / File(s) Summary
Builtin Evaluator Sources
agent-manager-service/catalog/builtin_evaluators.go
Added guards in embedded Python code for iteration_efficiency and token_efficiency evaluators to reset non-positive constraint values to configured defaults before computing scores. Modified LLM judge prompts (completeness, helpfulness, instruction_following, etc.) to fix newline escaping and update success-criteria rendering logic.
Evaluator Generator Script
agent-manager-service/scripts/generate-builtin-evaluators.sh
Enhanced _extract_prompt_template() to track intermediate f-string assignments, extract the first (rather than last) template match, decode escape sequences in conditional blocks, skip self-referential conditionals, and create output directory before file generation.
Evaluation Library - Core & Base
libs/amp-evaluation/src/amp_evaluation/evaluators/base.py, libs/amp-evaluation/src/amp_evaluation/codegen/evaluator_schema.py
Added required-parameter validation in _FunctionParamsMixin._init_function_params() to raise TypeError for missing required Param descriptors. Updated guide text to clarify that loop statements are not supported in curly-brace expressions, only Python expressions.
Evaluation Library - Builtin Implementations
libs/amp-evaluation/src/amp_evaluation/evaluators/builtin/standard.py, libs/amp-evaluation/src/amp_evaluation/evaluators/builtin/llm_judge.py
Added guards in TokenEfficiencyEvaluator and IterationCountEvaluator.evaluate() to reset zero or negative constraint values to configured limits. Inlined success-criteria formatting logic in InstructionFollowingEvaluator.build_prompt() with newline-delimited bullet formatting and fallback text.
Test Coverage
libs/amp-evaluation/tests/test_evaluators_builtin_core.py, libs/amp-evaluation/tests/test_llm_as_judge.py
Added tests for zero-constraint fallback behavior in token and iteration efficiency evaluators. Added tests to verify that missing required Param descriptors raise TypeError and that supplied required parameters are properly stored in config.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related issues

  • Fix generated builtin evaluator issues flagged in PR #531 #547: This PR directly addresses the issue by implementing guard clauses for non-positive max_iterations/max_tokens, fixing prompt template extraction and newline escaping, updating success-criteria formatting, and enforcing required parameter validation in the evaluation library.

Poem

🐰 A rabbit's ode to safer bounds,
Where zero guards prevent confounds,
Each template now extracted true,
Required params checked right through,
Prompts escape-free, scores rebound! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description provides a clear summary section addressing the key fixes across SDK, generator, and generated code, plus a test plan section. However, it does not follow the full required template structure (missing Purpose, Goals, Approach, and other sections). Consider expanding the description to include standard sections like Purpose (with issue links), Goals, Approach, and other relevant template sections for consistency with repository conventions.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically references fixing issues flagged in PR #531, which directly relates to the core objectives of the changeset including evaluator guards, error handling, and template fixes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
agent-manager-service/scripts/generate-builtin-evaluators.sh (1)

170-206: Prefer the returned AST node over matches[0].

You now collect the actual returned JoinedStr in fstrings, but Line 205 still assumes the first triple-quoted regex match is the prompt. A future helper like header = f"""...""" before return prompt will extract the wrong template even though the AST already knows which f-string is returned.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent-manager-service/scripts/generate-builtin-evaluators.sh` around lines
170 - 206, The code currently finds returned f-strings in the AST (fstrings /
assigned_fstrings) but then ignores them and extracts the template by regex from
the raw source (fstring_pattern / matches), which can pick the wrong
triple-quoted string; instead, locate the actual JoinedStr node from fstrings
(the returned AST node from func) and reconstruct the template from that node
(e.g., by mapping its Constant/FormattedValue parts back to source slices or
joining their .s/.value representations) rather than using matches[0]; update
the logic that sets template to prefer the AST-derived fstrings[0] content and
only fall back to regex matches when no suitable AST node exists.
libs/amp-evaluation/tests/test_evaluators_builtin_core.py (1)

483-497: Consider covering -1 as well as 0.

These regressions only exercise the zero case, but the implementation guards <= 0. Parameterizing the tests with 0 and -1 would pin the negative-constraint path too.

Also applies to: 535-551

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/amp-evaluation/tests/test_evaluators_builtin_core.py` around lines 483 -
497, Parameterize the failing-edge test to cover both 0 and -1: update
test_zero_task_constraint_falls_back_to_config to use pytest.mark.parametrize
with max_tokens_override values (0, -1), construct
Constraints.model_construct(max_tokens=max_tokens_override) and assert the
evaluator (TokenEfficiencyEvaluator.max_tokens=200) still yields passed True and
score 1.0; apply the same parametric change to the other analogous zero-case
test in the same test module that currently only checks 0 so the
negative-constraint path (<=0) is exercised as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@libs/amp-evaluation/src/amp_evaluation/evaluators/base.py`:
- Around line 742-751: The missing_required check runs before a clone via
_FunctionParamsMixin.with_config() can rehydrate/merge the existing
_func_config, causing required Params to appear missing; update the logic so
that _func_config is fully merged/rehydrated (via
_FunctionParamsMixin.with_config() or the same merge routine) before computing
missing_required from _func_param_descriptors and _func_config (affecting
FunctionEvaluator/FunctionLLMJudge cloning and the error raised referencing
func.__name__); in practice move or defer the missing_required
calculation/TypeError raise until after the config-merge step (or explicitly
perform the merge first) so judge.with_config(...) and similar clones no longer
fail.

---

Nitpick comments:
In `@agent-manager-service/scripts/generate-builtin-evaluators.sh`:
- Around line 170-206: The code currently finds returned f-strings in the AST
(fstrings / assigned_fstrings) but then ignores them and extracts the template
by regex from the raw source (fstring_pattern / matches), which can pick the
wrong triple-quoted string; instead, locate the actual JoinedStr node from
fstrings (the returned AST node from func) and reconstruct the template from
that node (e.g., by mapping its Constant/FormattedValue parts back to source
slices or joining their .s/.value representations) rather than using matches[0];
update the logic that sets template to prefer the AST-derived fstrings[0]
content and only fall back to regex matches when no suitable AST node exists.

In `@libs/amp-evaluation/tests/test_evaluators_builtin_core.py`:
- Around line 483-497: Parameterize the failing-edge test to cover both 0 and
-1: update test_zero_task_constraint_falls_back_to_config to use
pytest.mark.parametrize with max_tokens_override values (0, -1), construct
Constraints.model_construct(max_tokens=max_tokens_override) and assert the
evaluator (TokenEfficiencyEvaluator.max_tokens=200) still yields passed True and
score 1.0; apply the same parametric change to the other analogous zero-case
test in the same test module that currently only checks 0 so the
negative-constraint path (<=0) is exercised as well.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 00361a0b-05aa-475f-8a2e-fdef9333f845

📥 Commits

Reviewing files that changed from the base of the PR and between bff3514 and c1056d0.

📒 Files selected for processing (8)
  • agent-manager-service/catalog/builtin_evaluators.go
  • agent-manager-service/scripts/generate-builtin-evaluators.sh
  • libs/amp-evaluation/src/amp_evaluation/codegen/evaluator_schema.py
  • libs/amp-evaluation/src/amp_evaluation/evaluators/base.py
  • libs/amp-evaluation/src/amp_evaluation/evaluators/builtin/llm_judge.py
  • libs/amp-evaluation/src/amp_evaluation/evaluators/builtin/standard.py
  • libs/amp-evaluation/tests/test_evaluators_builtin_core.py
  • libs/amp-evaluation/tests/test_llm_as_judge.py

Comment on lines +742 to +751
# Raise immediately for required Params (no default) that were not provided
missing_required = [
name
for name, p in self._func_param_descriptors.items()
if p.default is _NO_DEFAULT and name not in self._func_config
]
if missing_required:
raise TypeError(
f"Evaluator function '{func.__name__}' missing required parameter(s): {', '.join(missing_required)}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve required function Params when cloning with with_config().

Line 742 now raises before _FunctionParamsMixin.with_config() rehydrates the existing _func_config, so any FunctionEvaluator/FunctionLLMJudge with a required function Param becomes unclonable. Even judge.with_config(criteria="x") still fails because criteria is only merged after construction.

💡 Suggested fix
-        new_instance = type(self)(self.func, name=self.name, **merged_class)
+        new_instance = type(self)(
+            self.func,
+            name=self.name,
+            **merged_class,
+            **self._func_config,
+            **func_overrides,
+        )
         new_instance.description = self.description
         new_instance.tags = list(self.tags)
         new_instance.version = self.version
         if self._aggregations:
             new_instance._aggregations = list(self._aggregations)
-        new_instance._func_config = {**self._func_config, **func_overrides}
         return new_instance
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/amp-evaluation/src/amp_evaluation/evaluators/base.py` around lines 742 -
751, The missing_required check runs before a clone via
_FunctionParamsMixin.with_config() can rehydrate/merge the existing
_func_config, causing required Params to appear missing; update the logic so
that _func_config is fully merged/rehydrated (via
_FunctionParamsMixin.with_config() or the same merge routine) before computing
missing_required from _func_param_descriptors and _func_config (affecting
FunctionEvaluator/FunctionLLMJudge cloning and the error raised referencing
func.__name__); in practice move or defer the missing_required
calculation/TypeError raise until after the config-merge step (or explicitly
perform the merge first) so judge.with_config(...) and similar clones no longer
fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant