feat: two-stage evaluation pipeline with prefilter (#14)#21
Open
don-petry wants to merge 4 commits intoJoaolfelicio:mainfrom
Open
feat: two-stage evaluation pipeline with prefilter (#14)#21don-petry wants to merge 4 commits intoJoaolfelicio:mainfrom
don-petry wants to merge 4 commits intoJoaolfelicio:mainfrom
Conversation
Adds a lightweight Stage 1 pre-filter to BaseEvaluator that classifies
interactions as rule-bearing or not before running full extraction.
Non-rule interactions with confidence > 0.8 skip Stage 2, targeting
50%+ reduction in LLM calls.
Key design decisions:
- Prefilter is implemented in BaseEvaluator, so all evaluators get it
- _parse_bool() fixes the bool("false")==True bug from PR Joaolfelicio#20
- Fail-open: prefilter errors pass through to full eval
- --skip-prefilter CLI flag to disable Stage 1
- Dashboard shows prefilter skip rate metrics
Closes Joaolfelicio#14
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Implements a two-stage evaluation pipeline by adding a lightweight prefilter step in BaseEvaluator to decide whether to skip full rule extraction, plus CLI/dashboard plumbing to expose prefilter behavior.
Changes:
- Added
PrefilterResult/PrefilterMetricsmodels and integrated Stage 1 prefiltering intoBaseEvaluator.evaluate_interaction(). - Added
--skip-prefilterflag and surfaced prefilter skipped/total stats in the dashboard footer. - Added a new prefilter prompt template and a new test suite covering prefilter logic and
_parse_bool().
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
context_scribe/evaluator/base_evaluator.py |
Adds prefilter stage, metrics tracking, and _parse_bool() for LLM JSON booleans. |
context_scribe/models/evaluator_models.py |
Introduces prefilter dataclasses and metrics helpers. |
context_scribe/evaluator/prefilter_template.md |
Provides the Stage 1 prompt template for rule-bearing classification. |
context_scribe/evaluator/__init__.py |
Updates get_evaluator() to accept and forward **kwargs. |
context_scribe/main.py |
Wires --skip-prefilter through to evaluator creation and displays prefilter stats in the dashboard. |
tests/test_prefilter.py |
Adds unit + integration-style tests for prefilter and boolean parsing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
All evaluator subclasses now accept and forward **kwargs to BaseEvaluator.__init__(), allowing skip_prefilter and future params to propagate correctly via get_evaluator(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
@Joaolfelicio - What do you think about this enhancement idea? |
- test_daemons.py: patch get_evaluator instead of removed direct imports - test_gemini_cli_llm.py: use skip_prefilter=True in CLI flag test (prefilter adds a second subprocess.run call) - main.py: guard prefilter metrics sync with isinstance check to prevent TypeError when evaluator is mocked Found via full test suite run with Python 3.12 + mcp installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address two Copilot review comments: 1. _parse_bool() now returns None for unrecognised/null values instead of defaulting to False. _pre_evaluate() treats None as pass-through to full evaluation, ensuring malformed LLM output doesn't silently skip rule extraction (fail-open behaviour). 2. Template loading in BaseEvaluator.__init__() now uses importlib.resources.files() instead of __file__-relative Path reads, so templates are accessible in packaged installs (wheel/zip). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Every user interaction is currently sent through full LLM evaluation, even routine code-assistance messages that contain no persistent preferences or rules. This wastes LLM calls and adds unnecessary latency and cost. A lightweight classification step can filter out the majority of non-rule interactions cheaply, targeting a 50%+ reduction in full evaluation calls.
Summary
BaseEvaluatorthat classifies interactions as rule-bearing or not before running full extraction--skip-prefilterCLI flag to disable Stage 1bool("false")==Truebug from PR feat: two-stage evaluation pipeline with prefilter #20 with proper_parse_bool()function**kwargsReplaces PR #20 (rebuilt cleanly on upstream/main using BaseEvaluator pattern).
Closes #14
Testing evidence (live, Python 3.12 + mcp, Claude CLI)
Live test details
Issues found and fixed during testing
test_daemons.py: patched evaluator classes no longer in main.py → fixed to patchget_evaluatortest_gemini_cli_llm.py: prefilter adds 2nd subprocess call → fixed withskip_prefilter=Truegenerate_layout: MagicMock metrics caused TypeError → fixed withisinstanceguard🤖 Generated with Claude Code