stop rollouts on incomplete responses (no content or tools) by willccbb · Pull Request #948 · PrimeIntellect-ai/verifiers

willccbb · 2026-02-21T03:01:17Z

Description

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Medium Risk
Changes core rollout termination/truncation behavior in MultiTurnEnv, which can alter evaluation/training outcomes when providers emit empty responses.

Overview
Multi-turn rollouts now terminate when the model returns an “incomplete” response (no message content and no tool calls). This is implemented as a new @vf.stop condition (has_incomplete_response) and by marking such trajectory steps as truncated in MultiTurnEnv.add_model_response.

Docs are updated to mention incomplete-response detection as a default stop condition, the wiki-search environment strips <think> content before LLM judging, and dataset builder fields in Environment are explicitly cast() for type safety; .gitignore also ignores packages/tasksets and packages/harnesses.

^{Written by Cursor Bugbot for commit 92a8da8. This will update automatically on new commits. Configure here.}

verifiers/envs/multiturn_env.py

…ns list Co-authored-by: will brown <willccbb@users.noreply.github.com>

mikasenghaas

why do we handle this as a stop condition and not a vf.Error (this is what we currently do in openai_chat_completions_client.py

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-23T05:10:37Z

environments/wiki_search/wiki_search.py

-        judge_response = await judge(prompt, completion, answer, state)
+        cleaned_completion = [
+            {x["role"]: x["content"].split("</think>")[-1] for x in completion}
+        ]


Dict comprehension creates wrong message structure for judge

High Severity

The dict comprehension {x["role"]: x["content"].split("</think>")[-1] for x in completion} creates a single dictionary with role names (e.g., "assistant", "tool") as keys and cleaned content as values, wrapped in a list. This produces a structure like [{"assistant": "...", "tool": "..."}] instead of the expected list of message dicts with "role" and "content" keys. When the judge's parse_answer tries to find assistant messages, it looks for a "role" key in each element — which doesn't exist in this dict — so it always returns None, making the judge evaluate against a None response. The brackets likely need to be moved so the list comprehension wraps each message individually.

cursor · 2026-02-23T05:10:37Z

environments/wiki_search/wiki_search.py

    async def judge_reward_func(judge, prompt, completion, answer, state) -> float:
-        judge_response = await judge(prompt, completion, answer, state)
+        cleaned_completion = [
+            {x["role"]: x["content"].split("</think>")[-1] for x in completion}


Split on None content causes AttributeError

Medium Severity

AssistantMessage.content is typed as MessageContent | None and defaults to None for tool-call-only messages. The expression x["content"].split("</think>") will raise an AttributeError when content is None. In a multi-turn tool-use environment like wiki_search, assistant messages with only tool_calls and no content are common. The error is silently caught by _call_individual_reward_func, returning a reward of 0.0, which silently corrupts training signal.

stop rollouts on incomplete responses (no content or tools)

43da99b

willccbb requested a review from eligotts February 21, 2026 03:02

willccbb added 2 commits February 20, 2026 19:03

ty

f46a4a8

ty

e8dc3a8

cursor bot reviewed Feb 21, 2026

View reviewed changes

verifiers/envs/multiturn_env.py Show resolved Hide resolved

docs: include incomplete response detection in built-in stop conditio…

7943015

…ns list Co-authored-by: will brown <willccbb@users.noreply.github.com>

eligotts approved these changes Feb 21, 2026

View reviewed changes

mikasenghaas reviewed Feb 21, 2026

View reviewed changes

diff

92a8da8

cursor bot reviewed Feb 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

stop rollouts on incomplete responses (no content or tools)#948

stop rollouts on incomplete responses (no content or tools)#948
willccbb wants to merge 5 commits intomainfrom
will/incomplete-responses

willccbb commented Feb 21, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

mikasenghaas left a comment

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 23, 2026

Uh oh!

cursor bot Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

willccbb commented Feb 21, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

Uh oh!

mikasenghaas left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 23, 2026

Choose a reason for hiding this comment

Dict comprehension creates wrong message structure for judge

Uh oh!

cursor bot Feb 23, 2026

Choose a reason for hiding this comment

Split on None content causes AttributeError

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

willccbb commented Feb 21, 2026 •

edited by cursor bot

Loading