Skip to content

feat: two-stage evaluation pipeline with prefilter (#14)#21

Open
don-petry wants to merge 4 commits intoJoaolfelicio:mainfrom
don-petry:feat/prefilter-pipeline
Open

feat: two-stage evaluation pipeline with prefilter (#14)#21
don-petry wants to merge 4 commits intoJoaolfelicio:mainfrom
don-petry:feat/prefilter-pipeline

Conversation

@don-petry
Copy link
Copy Markdown
Collaborator

@don-petry don-petry commented Apr 4, 2026

Why

Every user interaction is currently sent through full LLM evaluation, even routine code-assistance messages that contain no persistent preferences or rules. This wastes LLM calls and adds unnecessary latency and cost. A lightweight classification step can filter out the majority of non-rule interactions cheaply, targeting a 50%+ reduction in full evaluation calls.

Summary

  • Adds a lightweight Stage 1 pre-filter in BaseEvaluator that classifies interactions as rule-bearing or not before running full extraction
  • Non-rule interactions with confidence > 0.8 skip Stage 2 entirely
  • --skip-prefilter CLI flag to disable Stage 1
  • Dashboard shows prefilter skip rate metrics
  • Fixes bool("false")==True bug from PR feat: two-stage evaluation pipeline with prefilter #20 with proper _parse_bool() function
  • All evaluator subclasses updated to accept and forward **kwargs

Replaces PR #20 (rebuilt cleanly on upstream/main using BaseEvaluator pattern).

Closes #14

Testing evidence (live, Python 3.12 + mcp, Claude CLI)

Check Result
Full test suite: 84/84 passed ✅ PASS
Live: non-rule interaction skipped by prefilter (5.7s, 1 LLM call) ✅ PASS
Live: rule interaction passed through, extracted correctly (13.2s, 2 LLM calls) ✅ PASS
Live: skip_prefilter=True bypasses Stage 1 (metrics untouched) ✅ PASS
Live: prefilter skip rate = 50% (1 skipped / 2 total) ✅ PASS
Live: extracted rule scope=GLOBAL, content includes snake_case directive ✅ PASS

Live test details

Test 1 (non-rule "help me debug"): prefilter skipped=1, result=None, 5.7s
Test 2 (rule "use snake_case"):    prefilter passed=1, scope=GLOBAL, 13.2s
Test 3 (skip_prefilter=True):      metrics 0/0, full eval only, 12.4s

Issues found and fixed during testing

  • test_daemons.py: patched evaluator classes no longer in main.py → fixed to patch get_evaluator
  • test_gemini_cli_llm.py: prefilter adds 2nd subprocess call → fixed with skip_prefilter=True
  • Dashboard generate_layout: MagicMock metrics caused TypeError → fixed with isinstance guard

🤖 Generated with Claude Code

Adds a lightweight Stage 1 pre-filter to BaseEvaluator that classifies
interactions as rule-bearing or not before running full extraction.
Non-rule interactions with confidence > 0.8 skip Stage 2, targeting
50%+ reduction in LLM calls.

Key design decisions:
- Prefilter is implemented in BaseEvaluator, so all evaluators get it
- _parse_bool() fixes the bool("false")==True bug from PR Joaolfelicio#20
- Fail-open: prefilter errors pass through to full eval
- --skip-prefilter CLI flag to disable Stage 1
- Dashboard shows prefilter skip rate metrics

Closes Joaolfelicio#14

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 4, 2026 23:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a two-stage evaluation pipeline by adding a lightweight prefilter step in BaseEvaluator to decide whether to skip full rule extraction, plus CLI/dashboard plumbing to expose prefilter behavior.

Changes:

  • Added PrefilterResult / PrefilterMetrics models and integrated Stage 1 prefiltering into BaseEvaluator.evaluate_interaction().
  • Added --skip-prefilter flag and surfaced prefilter skipped/total stats in the dashboard footer.
  • Added a new prefilter prompt template and a new test suite covering prefilter logic and _parse_bool().

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
context_scribe/evaluator/base_evaluator.py Adds prefilter stage, metrics tracking, and _parse_bool() for LLM JSON booleans.
context_scribe/models/evaluator_models.py Introduces prefilter dataclasses and metrics helpers.
context_scribe/evaluator/prefilter_template.md Provides the Stage 1 prompt template for rule-bearing classification.
context_scribe/evaluator/__init__.py Updates get_evaluator() to accept and forward **kwargs.
context_scribe/main.py Wires --skip-prefilter through to evaluator creation and displays prefilter stats in the dashboard.
tests/test_prefilter.py Adds unit + integration-style tests for prefilter and boolean parsing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

All evaluator subclasses now accept and forward **kwargs to
BaseEvaluator.__init__(), allowing skip_prefilter and future
params to propagate correctly via get_evaluator().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@don-petry
Copy link
Copy Markdown
Collaborator Author

@Joaolfelicio - What do you think about this enhancement idea?

- test_daemons.py: patch get_evaluator instead of removed direct imports
- test_gemini_cli_llm.py: use skip_prefilter=True in CLI flag test
  (prefilter adds a second subprocess.run call)
- main.py: guard prefilter metrics sync with isinstance check to
  prevent TypeError when evaluator is mocked

Found via full test suite run with Python 3.12 + mcp installed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 5, 2026 17:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address two Copilot review comments:

1. _parse_bool() now returns None for unrecognised/null values instead
   of defaulting to False. _pre_evaluate() treats None as pass-through
   to full evaluation, ensuring malformed LLM output doesn't silently
   skip rule extraction (fail-open behaviour).

2. Template loading in BaseEvaluator.__init__() now uses
   importlib.resources.files() instead of __file__-relative Path reads,
   so templates are accessible in packaged installs (wheel/zip).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Two-stage evaluation pipeline for efficiency

2 participants