Fix builtin evaluator issues flagged in PR #531#598
Fix builtin evaluator issues flagged in PR #531#598
Conversation
📝 WalkthroughWalkthroughThis PR hardens evaluators against invalid constraint values by adding zero-and-negative guards for iteration/token limits, improves prompt template extraction in the code generator, enforces required parameter validation during decorator initialization, refines LLM judge prompt formatting, and expands test coverage for fallback scenarios. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
agent-manager-service/scripts/generate-builtin-evaluators.sh (1)
170-206: Prefer the returned AST node overmatches[0].You now collect the actual returned
JoinedStrinfstrings, but Line 205 still assumes the first triple-quoted regex match is the prompt. A future helper likeheader = f"""..."""beforereturn promptwill extract the wrong template even though the AST already knows which f-string is returned.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agent-manager-service/scripts/generate-builtin-evaluators.sh` around lines 170 - 206, The code currently finds returned f-strings in the AST (fstrings / assigned_fstrings) but then ignores them and extracts the template by regex from the raw source (fstring_pattern / matches), which can pick the wrong triple-quoted string; instead, locate the actual JoinedStr node from fstrings (the returned AST node from func) and reconstruct the template from that node (e.g., by mapping its Constant/FormattedValue parts back to source slices or joining their .s/.value representations) rather than using matches[0]; update the logic that sets template to prefer the AST-derived fstrings[0] content and only fall back to regex matches when no suitable AST node exists.libs/amp-evaluation/tests/test_evaluators_builtin_core.py (1)
483-497: Consider covering-1as well as0.These regressions only exercise the zero case, but the implementation guards
<= 0. Parameterizing the tests with0and-1would pin the negative-constraint path too.Also applies to: 535-551
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/amp-evaluation/tests/test_evaluators_builtin_core.py` around lines 483 - 497, Parameterize the failing-edge test to cover both 0 and -1: update test_zero_task_constraint_falls_back_to_config to use pytest.mark.parametrize with max_tokens_override values (0, -1), construct Constraints.model_construct(max_tokens=max_tokens_override) and assert the evaluator (TokenEfficiencyEvaluator.max_tokens=200) still yields passed True and score 1.0; apply the same parametric change to the other analogous zero-case test in the same test module that currently only checks 0 so the negative-constraint path (<=0) is exercised as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@libs/amp-evaluation/src/amp_evaluation/evaluators/base.py`:
- Around line 742-751: The missing_required check runs before a clone via
_FunctionParamsMixin.with_config() can rehydrate/merge the existing
_func_config, causing required Params to appear missing; update the logic so
that _func_config is fully merged/rehydrated (via
_FunctionParamsMixin.with_config() or the same merge routine) before computing
missing_required from _func_param_descriptors and _func_config (affecting
FunctionEvaluator/FunctionLLMJudge cloning and the error raised referencing
func.__name__); in practice move or defer the missing_required
calculation/TypeError raise until after the config-merge step (or explicitly
perform the merge first) so judge.with_config(...) and similar clones no longer
fail.
---
Nitpick comments:
In `@agent-manager-service/scripts/generate-builtin-evaluators.sh`:
- Around line 170-206: The code currently finds returned f-strings in the AST
(fstrings / assigned_fstrings) but then ignores them and extracts the template
by regex from the raw source (fstring_pattern / matches), which can pick the
wrong triple-quoted string; instead, locate the actual JoinedStr node from
fstrings (the returned AST node from func) and reconstruct the template from
that node (e.g., by mapping its Constant/FormattedValue parts back to source
slices or joining their .s/.value representations) rather than using matches[0];
update the logic that sets template to prefer the AST-derived fstrings[0]
content and only fall back to regex matches when no suitable AST node exists.
In `@libs/amp-evaluation/tests/test_evaluators_builtin_core.py`:
- Around line 483-497: Parameterize the failing-edge test to cover both 0 and
-1: update test_zero_task_constraint_falls_back_to_config to use
pytest.mark.parametrize with max_tokens_override values (0, -1), construct
Constraints.model_construct(max_tokens=max_tokens_override) and assert the
evaluator (TokenEfficiencyEvaluator.max_tokens=200) still yields passed True and
score 1.0; apply the same parametric change to the other analogous zero-case
test in the same test module that currently only checks 0 so the
negative-constraint path (<=0) is exercised as well.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 00361a0b-05aa-475f-8a2e-fdef9333f845
📒 Files selected for processing (8)
agent-manager-service/catalog/builtin_evaluators.goagent-manager-service/scripts/generate-builtin-evaluators.shlibs/amp-evaluation/src/amp_evaluation/codegen/evaluator_schema.pylibs/amp-evaluation/src/amp_evaluation/evaluators/base.pylibs/amp-evaluation/src/amp_evaluation/evaluators/builtin/llm_judge.pylibs/amp-evaluation/src/amp_evaluation/evaluators/builtin/standard.pylibs/amp-evaluation/tests/test_evaluators_builtin_core.pylibs/amp-evaluation/tests/test_llm_as_judge.py
| # Raise immediately for required Params (no default) that were not provided | ||
| missing_required = [ | ||
| name | ||
| for name, p in self._func_param_descriptors.items() | ||
| if p.default is _NO_DEFAULT and name not in self._func_config | ||
| ] | ||
| if missing_required: | ||
| raise TypeError( | ||
| f"Evaluator function '{func.__name__}' missing required parameter(s): {', '.join(missing_required)}" | ||
| ) |
There was a problem hiding this comment.
Preserve required function Params when cloning with with_config().
Line 742 now raises before _FunctionParamsMixin.with_config() rehydrates the existing _func_config, so any FunctionEvaluator/FunctionLLMJudge with a required function Param becomes unclonable. Even judge.with_config(criteria="x") still fails because criteria is only merged after construction.
💡 Suggested fix
- new_instance = type(self)(self.func, name=self.name, **merged_class)
+ new_instance = type(self)(
+ self.func,
+ name=self.name,
+ **merged_class,
+ **self._func_config,
+ **func_overrides,
+ )
new_instance.description = self.description
new_instance.tags = list(self.tags)
new_instance.version = self.version
if self._aggregations:
new_instance._aggregations = list(self._aggregations)
- new_instance._func_config = {**self._func_config, **func_overrides}
return new_instance🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@libs/amp-evaluation/src/amp_evaluation/evaluators/base.py` around lines 742 -
751, The missing_required check runs before a clone via
_FunctionParamsMixin.with_config() can rehydrate/merge the existing
_func_config, causing required Params to appear missing; update the logic so
that _func_config is fully merged/rehydrated (via
_FunctionParamsMixin.with_config() or the same merge routine) before computing
missing_required from _func_param_descriptors and _func_config (affecting
FunctionEvaluator/FunctionLLMJudge cloning and the error raised referencing
func.__name__); in practice move or defer the missing_required
calculation/TypeError raise until after the config-merge step (or explicitly
perform the merge first) so judge.with_config(...) and similar clones no longer
fail.
Summary
Fixes all issues tracked in #547.
SDK (
amp-evaluation)task.constraintsoverrides before division inTokenEfficiencyEvaluatorandIterationCountEvaluator— preventsZeroDivisionErrorwhen task constraints supply an invalid valueTypeErrorat init time in_init_function_paramswhen a requiredParam(no default) is not provided, instead of deferring the error silently to call timeself._format_success_criteria(task)inInstructionFollowingEvaluator.build_promptwith an equivalent inline expression usingchr(10)to stay compatible with Python < 3.12evaluator_schema.pydocs: only comprehension expressions are supported inside{}, not loop statementsGenerator (
generate-builtin-evaluators.sh)return prompt(variable assigned an f-string, then returned) patterns that previously produced emptySourcein the catalog\\n→ newline) in captured conditional f-string values before inlining into prompt templates — fixes literal\\nTask:and\\nContext:in generated catalog entriesself.*config params; removes the spurious emptyContext:line from safety/tone templatesLLM_JUDGE_BASE_FILEbefore writing to prevent failures when--outputpoints to a non-default locationGenerated
builtin_evaluators.goto reflect all template fixesTest plan
python -m pytestinlibs/amp-evaluation— 696 passedgo build ./...inagent-manager-service— cleaninstruction_followingcatalog entry showsSuccess criteria: {(...)}expressionpath_efficiencyandsafety/tonecatalog entries no longer contain\\nTask:or\\nContext:Summary by CodeRabbit
Bug Fixes
Documentation