feat: two-stage evaluation pipeline with prefilter (#14)#21
feat: two-stage evaluation pipeline with prefilter (#14)#21don-petry wants to merge 4 commits intoJoaolfelicio:mainfrom
Conversation
Adds a lightweight Stage 1 pre-filter to BaseEvaluator that classifies
interactions as rule-bearing or not before running full extraction.
Non-rule interactions with confidence > 0.8 skip Stage 2, targeting
50%+ reduction in LLM calls.
Key design decisions:
- Prefilter is implemented in BaseEvaluator, so all evaluators get it
- _parse_bool() fixes the bool("false")==True bug from PR Joaolfelicio#20
- Fail-open: prefilter errors pass through to full eval
- --skip-prefilter CLI flag to disable Stage 1
- Dashboard shows prefilter skip rate metrics
Closes Joaolfelicio#14
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Implements a two-stage evaluation pipeline by adding a lightweight prefilter step in BaseEvaluator to decide whether to skip full rule extraction, plus CLI/dashboard plumbing to expose prefilter behavior.
Changes:
- Added
PrefilterResult/PrefilterMetricsmodels and integrated Stage 1 prefiltering intoBaseEvaluator.evaluate_interaction(). - Added
--skip-prefilterflag and surfaced prefilter skipped/total stats in the dashboard footer. - Added a new prefilter prompt template and a new test suite covering prefilter logic and
_parse_bool().
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
context_scribe/evaluator/base_evaluator.py |
Adds prefilter stage, metrics tracking, and _parse_bool() for LLM JSON booleans. |
context_scribe/models/evaluator_models.py |
Introduces prefilter dataclasses and metrics helpers. |
context_scribe/evaluator/prefilter_template.md |
Provides the Stage 1 prompt template for rule-bearing classification. |
context_scribe/evaluator/__init__.py |
Updates get_evaluator() to accept and forward **kwargs. |
context_scribe/main.py |
Wires --skip-prefilter through to evaluator creation and displays prefilter stats in the dashboard. |
tests/test_prefilter.py |
Adds unit + integration-style tests for prefilter and boolean parsing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
All evaluator subclasses now accept and forward **kwargs to BaseEvaluator.__init__(), allowing skip_prefilter and future params to propagate correctly via get_evaluator(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@Joaolfelicio - What do you think about this enhancement idea? |
- test_daemons.py: patch get_evaluator instead of removed direct imports - test_gemini_cli_llm.py: use skip_prefilter=True in CLI flag test (prefilter adds a second subprocess.run call) - main.py: guard prefilter metrics sync with isinstance check to prevent TypeError when evaluator is mocked Found via full test suite run with Python 3.12 + mcp installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address two Copilot review comments: 1. _parse_bool() now returns None for unrecognised/null values instead of defaulting to False. _pre_evaluate() treats None as pass-through to full evaluation, ensuring malformed LLM output doesn't silently skip rule extraction (fail-open behaviour). 2. Template loading in BaseEvaluator.__init__() now uses importlib.resources.files() instead of __file__-relative Path reads, so templates are accessible in packaged installs (wheel/zip). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
don-petry
left a comment
There was a problem hiding this comment.
Automated review — NEEDS HUMAN REVIEW
Risk: MEDIUM
Reviewed commit: 987bb4693058055ba6e91a2a8e5f75b9b2c9b10a
Council vote: security=MEDIUM · correctness=MEDIUM · maintainability=MEDIUM
Summary
The two-stage prefilter pipeline is well-structured, following existing BaseEvaluator patterns, and CI is fully green. The security lens found no new vulnerabilities; format-string and **kwargs patterns are pre-existing. The correctness lens confirms the implementation matches issue #14's acceptance criteria with all prior Copilot findings resolved. The maintainability lens flags four minor issues — missing docstrings, a magic confidence threshold, test setup duplication, and a test/production inconsistency in template loading — plus an unanswered question from collaborator don-petry that gates merge under the repo's review policy.
Linked issue analysis
Issue #14 (two-stage evaluation pipeline with prefilter): The diff implements _prefilter_interaction() in BaseEvaluator with fail-open semantics, PrefilterResult/PrefilterMetrics models, a --skip-prefilter CLI flag, dashboard prefilter stats, a prefilter prompt template, and a 244-line test suite covering boundary conditions, error paths, and the skip flag. All Copilot first-cycle findings (constructor kwarg propagation, _parse_bool fail-open, importlib.resources template loading) were addressed in follow-up commits.
Findings
Minor
- [minor] correctness + maintainability ·
tests/test_prefilter.py:120-121— Tests load templates viaPath(__file__).parent.parentrather thanimportlib.resources.files(), inconsistent with the production path changed in this PR. Theimportlibcode path is not exercised by any test; a missing package-data entry would only surface at install time. (Both correctness and maintainability lenses flagged this.) - [minor] maintainability ·
tests/test_prefilter.py— ~15 lines of setup code (patching__init__, reading template files) are duplicated across 4 test functions. A pytest fixture would reduce coupling and clarify intent for future maintainers. - [minor] correctness ·
context_scribe/evaluator/base_evaluator.py— The prefilter JSON extraction regex uses[^}]*which silently fails if the LLM wraps output in a parent JSON object with nested braces. Fail-open behavior makes this non-breaking in practice. - [minor] maintainability ·
context_scribe/models/evaluator_models.py:27— Confidence threshold0.8inPrefilterResult.should_skip_full_evalis a magic number with no named constant or config surface for per-evaluator or per-deployment tuning. - [minor] maintainability ·
context_scribe/models/evaluator_models.py—CONTRIBUTING.mdrequires Google-style docstrings for public classes/methods.PrefilterResult.should_skip_full_eval(property) andPrefilterMetrics.record_resultare missing docstrings needed for auto-generated docs and for authors extendingBaseEvaluator.
Info
- [info] security ·
context_scribe/evaluator/base_evaluator.py:56— Prefilter template usesstr.format()with user-controlledinteraction.content; curly-brace patterns in content could causeKeyError(DoS of that evaluation) or leakinternal_signature. Pre-existing pattern from the full-evaluation path; not newly introduced. - [info] security ·
context_scribe/evaluator/base_evaluator.py:18— Prefilter is correctly fail-open: timeouts, parse errors, and unrecognized_parse_boolvalues all pass through to full evaluation. Appropriate since the prefilter is a cost optimisation, not a security gate. - [info] security ·
context_scribe/evaluator/__init__.py:16— Open**kwargsinget_evaluator()allows arbitrary kwargs to propagate toBaseEvaluator.__init__(). No injection risk at present (all call sites are internal CLI via Click). - [info] correctness ·
context_scribe/models/evaluator_models.py:45—PrefilterMetrics.record_result()increments bothprefilter_errorsandprefilter_passedwhen result isNone, conflating genuine passes with error fall-throughs in the dashboard display. Skip-rate calculation remains correct. - [info] correctness + maintainability — Collaborator don-petry left an unanswered question for @Joaolfelicio (2026-04-05) about an enhancement idea. Not a change request, but the repo's review policy gates on all reviewer questions being resolved. (Both correctness and maintainability lenses flagged this.)
- [info] maintainability ·
context_scribe/main.py:263—PrefilterMetrics.prefilter_errorsis tracked but never surfaced in the dashboard or logged at INFO level; silent failures would be invisible in production. - [info] maintainability ·
context_scribe/main.py:265— The guardisinstance(getattr(metrics, 'prefilter_passed', None), int)is unusual;isinstance(metrics, PrefilterMetrics)would be more idiomatic and self-documenting.
CI status
All CI checks are green (pip-audit, full test suite). No failing or pending checks.
Reviewed automatically by the don-petry PR-review council (security: opus 4.6 · correctness: sonnet 4.6 · maintainability: sonnet 4.6 · synthesis: sonnet 4.6). The marker on line 1 lets the agent detect new commits and re-review. Reply with @don-petry if you need a human.
Why
Every user interaction is currently sent through full LLM evaluation, even routine code-assistance messages that contain no persistent preferences or rules. This wastes LLM calls and adds unnecessary latency and cost. A lightweight classification step can filter out the majority of non-rule interactions cheaply, targeting a 50%+ reduction in full evaluation calls.
Summary
BaseEvaluatorthat classifies interactions as rule-bearing or not before running full extraction--skip-prefilterCLI flag to disable Stage 1bool("false")==Truebug from PR feat: two-stage evaluation pipeline with prefilter #20 with proper_parse_bool()function**kwargsReplaces PR #20 (rebuilt cleanly on upstream/main using BaseEvaluator pattern).
Closes #14
Testing evidence (live, Python 3.12 + mcp, Claude CLI)
Live test details
Issues found and fixed during testing
test_daemons.py: patched evaluator classes no longer in main.py → fixed to patchget_evaluatortest_gemini_cli_llm.py: prefilter adds 2nd subprocess call → fixed withskip_prefilter=Truegenerate_layout: MagicMock metrics caused TypeError → fixed withisinstanceguard🤖 Generated with Claude Code