feat: two-stage evaluation pipeline with prefilter (#14)#29
Conversation
Adds a lightweight Stage 1 pre-filter to BaseEvaluator that classifies
interactions as rule-bearing or not before running full extraction.
Non-rule interactions with confidence > 0.8 skip Stage 2, targeting
50%+ reduction in LLM calls.
Key design decisions:
- Prefilter is implemented in BaseEvaluator, so all evaluators get it
- _parse_bool() fixes the bool("false")==True bug from PR #20
- Fail-open: prefilter errors pass through to full eval
- --skip-prefilter CLI flag to disable Stage 1
- Dashboard shows prefilter skip rate metrics
Closes #14
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All evaluator subclasses now accept and forward **kwargs to BaseEvaluator.__init__(), allowing skip_prefilter and future params to propagate correctly via get_evaluator(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test_daemons.py: patch get_evaluator instead of removed direct imports - test_gemini_cli_llm.py: use skip_prefilter=True in CLI flag test (prefilter adds a second subprocess.run call) - main.py: guard prefilter metrics sync with isinstance check to prevent TypeError when evaluator is mocked Found via full test suite run with Python 3.12 + mcp installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address two Copilot review comments: 1. _parse_bool() now returns None for unrecognised/null values instead of defaulting to False. _pre_evaluate() treats None as pass-through to full evaluation, ensuring malformed LLM output doesn't silently skip rule extraction (fail-open behaviour). 2. Template loading in BaseEvaluator.__init__() now uses importlib.resources.files() instead of __file__-relative Path reads, so templates are accessible in packaged installs (wheel/zip). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Introduces a two-stage evaluation pipeline to reduce unnecessary full LLM rule-extraction calls by adding a lightweight prefilter step, with CLI control and dashboard metrics.
Changes:
- Adds Stage 1 prefilter classification (with fail-open behavior) ahead of full rule extraction in
BaseEvaluator. - Adds prefilter metrics (passed/skipped/errors) and exposes skip-rate on the dashboard; adds
--skip-prefilterCLI flag. - Updates evaluator factory/constructors and adds tests covering prefilter behavior and
_parse_bool()parsing.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_prefilter.py | New unit/integration tests for prefilter result/metrics, _parse_bool, and Stage 1/2 behavior. |
| tests/test_gemini_cli_llm.py | Adjusts CLI invocation test to bypass prefilter so call-count assertions remain valid. |
| context_scribe/models/evaluator_models.py | Adds PrefilterResult and PrefilterMetrics dataclasses used by evaluators and dashboard. |
| context_scribe/main.py | Wires --skip-prefilter, forwards flag into evaluator construction, and displays prefilter stats in UI. |
| context_scribe/evaluator/prefilter_template.md | Adds the Stage 1 prompt template for rule/non-rule classification. |
| context_scribe/evaluator/gemini_cli_llm.py | Updates evaluator constructor to accept/forward **kwargs into BaseEvaluator. |
| context_scribe/evaluator/copilot_llm.py | Updates evaluator constructor to accept/forward **kwargs into BaseEvaluator. |
| context_scribe/evaluator/claude_llm.py | Updates evaluator constructor to accept/forward **kwargs into BaseEvaluator. |
| context_scribe/evaluator/base_evaluator.py | Implements prefilter stage, _parse_bool, metrics recording, and resource-based template loading. |
| context_scribe/evaluator/init.py | Changes get_evaluator to accept **kwargs and forward them to evaluator constructors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def get_evaluator(name: str, **kwargs) -> BaseEvaluator: | ||
| """Return an evaluator instance by name. Raises ValueError for unknown names.""" | ||
| cls = EVALUATOR_REGISTRY.get(name) | ||
| if cls is None: | ||
| raise ValueError( | ||
| f"Unknown evaluator '{name}'. " | ||
| f"Available: {', '.join(sorted(EVALUATOR_REGISTRY))}" | ||
| ) | ||
| return cls() | ||
| return cls(**kwargs) |
There was a problem hiding this comment.
get_evaluator(..., **kwargs) now forwards keyword args to the evaluator class constructor, but at least AnthropicEvaluator still defines __init__(model=...) and will raise TypeError: got an unexpected keyword argument 'skip_prefilter'. Either update all registered evaluators to accept/forward **kwargs (and pass skip_prefilter into BaseEvaluator.__init__), or make get_evaluator filter/only pass supported kwargs per evaluator.
There was a problem hiding this comment.
Resolved — AnthropicEvaluator.__init__ now accepts **kwargs and forwards to BaseEvaluator, matching the other evaluators. See ca40c1e.
| if evaluator_name == "auto": | ||
| evaluator_name = _detect_evaluator(tool) | ||
| evaluator = get_evaluator(evaluator_name) | ||
| evaluator = get_evaluator(evaluator_name, skip_prefilter=skip_prefilter) |
There was a problem hiding this comment.
get_evaluator(evaluator_name, skip_prefilter=skip_prefilter) passes skip_prefilter even when it is False, which will crash for any evaluator that hasn’t been updated to accept this kwarg (e.g., the Anthropic SDK evaluator). Consider only passing the kwarg when skip_prefilter is True, and/or ensure all evaluators accept/forward **kwargs into BaseEvaluator.
| evaluator = get_evaluator(evaluator_name, skip_prefilter=skip_prefilter) | |
| if skip_prefilter: | |
| evaluator = get_evaluator(evaluator_name, skip_prefilter=True) | |
| else: | |
| evaluator = get_evaluator(evaluator_name) |
There was a problem hiding this comment.
Resolved — all evaluators now accept **kwargs, so passing skip_prefilter=False no longer crashes. See ca40c1e.
| def _load_package_template(filename: str) -> str: | ||
| """Load a template file from this package using importlib.resources.""" | ||
| return ( | ||
| importlib.resources.files("context_scribe.evaluator") | ||
| .joinpath(filename) | ||
| .read_text(encoding="utf-8") | ||
| ) | ||
|
|
||
|
|
||
| class BaseEvaluator(ABC): | ||
| def __init__(self): | ||
| # Load the prompt template | ||
| template_path = Path(__file__).parent / "prompt_template.md" | ||
| with open(template_path, "r", encoding="utf-8") as f: | ||
| self.prompt_template = f.read() | ||
| def __init__(self, skip_prefilter: bool = False): | ||
| self.skip_prefilter = skip_prefilter | ||
| self.metrics = PrefilterMetrics() | ||
| # Load the prompt templates via importlib.resources (works in | ||
| # packaged installs such as wheels / zip imports). | ||
| self.prompt_template = _load_package_template("prompt_template.md") | ||
| self._prefilter_template = _load_package_template("prefilter_template.md") |
There was a problem hiding this comment.
importlib.resources will only find prompt_template.md / prefilter_template.md if those files are included as package data in the built wheel/sdist. The repo’s pyproject.toml doesn’t currently declare package data, so a packaged install may raise FileNotFoundError here. Add setuptools package-data configuration (e.g., include *.md under context_scribe/evaluator) to ensure these templates ship with the distribution.
There was a problem hiding this comment.
Resolved — added [tool.setuptools.package-data] for *.md files under context_scribe.evaluator in pyproject.toml. See ca40c1e.
| total_processed = self.prefilter_passed + self.prefilter_skipped | ||
| skip_rate = (self.prefilter_skipped / total_processed * 100) if total_processed > 0 else 0 | ||
| stats.add_row( | ||
| Text(f" System: Active", style="green"), | ||
| Text(f"Prefilter: {self.prefilter_skipped} skipped / {total_processed} total ({skip_rate:.0f}%)", style="dim"), | ||
| Text(f"Total Rules Extracted: {self.update_count} ", style="bold green") |
There was a problem hiding this comment.
When total_processed == 0, skip_rate is set to 0 (an int), but later formatted with {skip_rate:.0f}. Python’s int formatting doesn’t support the f specifier, so this will raise a ValueError the first time the dashboard renders before any interactions are processed. Use 0.0 (float) for the empty case to keep formatting consistent.
There was a problem hiding this comment.
Resolved — changed else 0 to else 0.0 so the :.0f format specifier works correctly. See ca40c1e.
don-petry
left a comment
There was a problem hiding this comment.
Automated review — NEEDS HUMAN REVIEW
Risk: MEDIUM
Reviewed commit: d4ec61460a32c01a2a1480cf5a31641617708da4
Council vote: security=MEDIUM · correctness=MEDIUM · maintainability=MEDIUM
Summary
The PR introduces a well-structured two-stage prefilter pipeline that meaningfully addresses issue #14, with good test coverage for the new prefilter logic and a sensible fail-open design. However, two blocking issues prevent merging: AnthropicEvaluator was not updated with **kwargs support (causing TypeError for --evaluator anthropic users), and the branch has unresolved merge conflicts with main. Additionally, pyproject.toml is missing package-data declarations needed for importlib.resources to work in wheel installs, and a type mismatch in skip_rate causes a crash on startup under Python 3.10.
Linked issue analysis
The diff substantively addresses issue #14's goal of reducing unnecessary full LLM rule-extraction calls: the prefilter stage correctly short-circuits evaluation for non-rule content, the --skip-prefilter flag provides user control, and dashboard metrics expose the skip rate. Core acceptance criteria appear met pending resolution of the packaging and evaluator-update issues above.
Findings
Major
- [major] [correctness]
context_scribe/evaluator/anthropic_llm.py:18—AnthropicEvaluator.__init__lacks**kwargsand does not forwardskip_prefiltertoBaseEvaluator. Any call toget_evaluator('anthropic', skip_prefilter=...)raisesTypeError. Three other evaluators (Claude, Copilot, Gemini) were correctly updated; Anthropic was overlooked. - [major] [maintainability] — PR has unresolved merge conflicts with
main(mergeStateStatus: DIRTY, mergeable: CONFLICTING). The branch must be rebased before this PR can land.
Minor
- [minor] [correctness]
context_scribe/main.py:111—skip_rateis assigned the integer literal0whentotal_processed == 0. Formatting with{skip_rate:.0f}raisesValueErroron Python 3.10 (project minimum). Fix: useelse 0.0. - [minor] [correctness]
context_scribe/evaluator/base_evaluator.py:51—_load_package_templateusesimportlib.resources.files('context_scribe.evaluator'), butpyproject.tomlhas no[tool.setuptools.package-data]entry for*.mdfiles. Built wheels/sdists will fail withFileNotFoundErrorfor pip-installed packages — the PR description explicitly claims packaged-install compatibility. - [minor] [maintainability]
tests/test_prefilter.py— Integration test setup (bypassing__init__, manually loading templates viaPath) is copy-pasted verbatim across 5 test functions. A shared pytest fixture would reduce duplication. - [minor] [maintainability]
tests/test_prefilter.py— Tests load templates viaPath(__file__)— the old approach this PR replaces withimportlib.resources. If templates move within the package, tests will break differently from production code. - [minor] [maintainability]
context_scribe/models/evaluator_models.py—PrefilterMetrics.record_result(None)increments bothprefilter_errorsandprefilter_passed, conflating error pass-throughs with genuine passes.skip_ratesilently includes error events in its denominator. - [minor] [maintainability]
context_scribe/main.py—DashboardDataholdsprefilter_passedandprefilter_skippedas plain integers manually synced fromevaluator.metrics. A direct reference toPrefilterMetricswould eliminate the two-source-of-truth pattern.
Info
- [info] [security]
context_scribe/evaluator/prefilter_template.md:17— User content is embedded directly into the prefilter template via{content}with only triple-quote delimiters, creating a prompt injection surface. Impact is limited to optimization bypass (not a security bypass) due to fail-open design. - [info] [security] — No SAST tools (CodeQL, Semgrep, Bandit, gitleaks) configured in CI; only
pip-auditis present. Consider adding static analysis for Python code. - [info] [correctness]
tests/test_prefilter.py— Tests usePath(__file__)rather than_load_package_template(), so theimportlib.resourcesloading path is not exercised by any test. - [info] [correctness]
context_scribe/main.py:265—PrefilterMetrics.prefilter_errorsis tracked but never surfaced in the dashboard. Silent errors may go unnoticed in production. - [info] [maintainability]
context_scribe/evaluator/base_evaluator.py— New public methodevaluate_interactionhas no docstring;_pre_evaluateand_parse_booluse plain docstrings.CONTRIBUTING.mdrequires Google-style docstrings for public classes and methods.
CI status
All CI checks are green (pip-audit passes, tests pass). The branch cannot be merged due to unresolved merge conflicts (mergeStateStatus: DIRTY).
Reviewed automatically by the don-petry PR-review council (security: opus 4.6 · correctness: sonnet 4.6 · maintainability: sonnet 4.6 · synthesis: sonnet 4.6). The marker on line 1 lets the agent detect new commits and re-review. Reply with @don-petry if you need a human.
|
@copilot fix the problems mentioned in the comment above |
- Resolve merge conflicts between prefilter and multi-tool features - Add **kwargs to AnthropicEvaluator to forward skip_prefilter - Fix skip_rate int/float type mismatch causing ValueError on Python 3.10 - Add package-data config for .md templates in wheel/sdist builds - Fix PrefilterMetrics.record_result(None) double-counting errors as passes - Surface prefilter errors count on dashboard footer - Extract shared pytest fixture in test_prefilter.py - Update fake_run_daemon signature in test_multi_tool.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
All review findings have been addressed in ca40c1e:
CI is green. Ready for re-review. |
Automated review — APPROVEDRisk: MEDIUM SummaryAll seven blocking and minor issues flagged in the prior review (d4ec614) are addressed in ca40c1e: AnthropicEvaluator **kwargs forwarding is fixed, merge conflicts are resolved (mergeStateStatus: CLEAN), skip_rate uses 0.0, pyproject.toml has package-data, PrefilterMetrics.record_result(None) no longer double-counts, prefilter errors are surfaced in the dashboard footer, and the shared pytest fixture is extracted. CI is fully green and the branch is mergeable. Remaining items are all info/minor: prompt injection via .format() is mitigated by fail-open exception handling, and the two-source-of-truth dashboard pattern is cosmetic. FindingsInfo
CI statusCI is fully green and the branch is mergeable (mergeStateStatus: CLEAN). Reviewed by the don-petry PR-review cascade (triage: haiku 4.5 → deep: sonnet 4.6 → audit: opus 4.6). Reply with |
Why
Every user interaction is currently sent through full LLM evaluation, even routine code-assistance messages that contain no persistent preferences or rules. This wastes LLM calls and adds unnecessary latency and cost. A lightweight classification step can filter out the majority of non-rule interactions cheaply, targeting a 50%+ reduction in full evaluation calls.
Summary
BaseEvaluatorthat classifies interactions as rule-bearing or not before running full extraction--skip-prefilterCLI flag to disable Stage 1_parse_bool()returnsOptional[bool]for fail-open behavior (unrecognized values pass through to full eval)importlib.resourcesfor packaged install compatibility**kwargsReplaces #21 (fork archived, branch rebased on current main including PR #19).
Closes #14
Testing evidence
_parse_boolfail-open: unknown values return None (pass-through)importlib.resourcesReview history
All 5 Copilot review comments from #21 were addressed and resolved. This PR carries the same code plus the merge resolution.
🤖 Generated with Claude Code