Skip to content

Feat/bad syntax cancels eval#289

Open
ibraheem-abe wants to merge 21 commits intomainfrom
feat/bad-syntax-cancels-eval
Open

Feat/bad syntax cancels eval#289
ibraheem-abe wants to merge 21 commits intomainfrom
feat/bad-syntax-cancels-eval

Conversation

@ibraheem-abe
Copy link
Contributor

  • Adds two early termination checks during screener evaluations:
    • Threshold impossible — if the agent can't mathematically pass the threshold even if it passes every remaining run, skip the rest
    • Syntax penalty — if the agent produces a patch with invalid syntax (error code 1040), cancel all remaining runs and fail the agent out of screening
  • Platform sends pass_threshold to the validator in the request-evaluation response
  • Validator tracks results as runs complete using asyncio.wait(FIRST_COMPLETED) instead of asyncio.gather
  • Cancelled runs are marked as skipped (new terminal status) via a new skip-evaluation-run endpoint
  • evaluations_hydrated SQL view updated to treat skipped as terminal and detect syntax penalties
  • handle_evaluation_if_finished transitions agents to failed_screening on syntax penalty instead of re-queuing
  • Sandbox cleanup added to to tear down docker resources
  • Only affects screeners — validators are unaffected (pass_threshold is None for valis)

Files changed

  • validator/main.py — orchestrator updated, RunOutcome dataclass, cancellation logic
  • api/endpoints/validator.pypass_threshold population, skip endpoint, penalty handling
  • api/endpoints/validator_models.py — new skip request/response models, pass_threshold field
  • api/src/backend/postgres_schema.sql — updated view + skipped enum migration
  • models/evaluation_run.pyskipped status added
  • evaluator/sandbox/sandbox_manager.pycleanup_sandbox() public method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments