feat: two-stage evaluation pipeline with prefilter (#14) by don-petry · Pull Request #21 · Joaolfelicio/context-scribe

don-petry · 2026-04-04T23:41:19Z

Why

Every user interaction is currently sent through full LLM evaluation, even routine code-assistance messages that contain no persistent preferences or rules. This wastes LLM calls and adds unnecessary latency and cost. A lightweight classification step can filter out the majority of non-rule interactions cheaply, targeting a 50%+ reduction in full evaluation calls.

Summary

Adds a lightweight Stage 1 pre-filter in BaseEvaluator that classifies interactions as rule-bearing or not before running full extraction
Non-rule interactions with confidence > 0.8 skip Stage 2 entirely
--skip-prefilter CLI flag to disable Stage 1
Dashboard shows prefilter skip rate metrics
Fixes bool("false")==True bug from PR feat: two-stage evaluation pipeline with prefilter #20 with proper _parse_bool() function
All evaluator subclasses updated to accept and forward **kwargs

Replaces PR #20 (rebuilt cleanly on upstream/main using BaseEvaluator pattern).

Closes #14

Testing evidence (live, Python 3.12 + mcp, Claude CLI)

Check	Result
Full test suite: 84/84 passed	✅ PASS
Live: non-rule interaction skipped by prefilter (5.7s, 1 LLM call)	✅ PASS
Live: rule interaction passed through, extracted correctly (13.2s, 2 LLM calls)	✅ PASS
Live: skip_prefilter=True bypasses Stage 1 (metrics untouched)	✅ PASS
Live: prefilter skip rate = 50% (1 skipped / 2 total)	✅ PASS
Live: extracted rule scope=GLOBAL, content includes snake_case directive	✅ PASS

Live test details

Test 1 (non-rule "help me debug"): prefilter skipped=1, result=None, 5.7s
Test 2 (rule "use snake_case"):    prefilter passed=1, scope=GLOBAL, 13.2s
Test 3 (skip_prefilter=True):      metrics 0/0, full eval only, 12.4s

Issues found and fixed during testing

test_daemons.py: patched evaluator classes no longer in main.py → fixed to patch get_evaluator
test_gemini_cli_llm.py: prefilter adds 2nd subprocess call → fixed with skip_prefilter=True
Dashboard generate_layout: MagicMock metrics caused TypeError → fixed with isinstance guard

🤖 Generated with Claude Code

Adds a lightweight Stage 1 pre-filter to BaseEvaluator that classifies interactions as rule-bearing or not before running full extraction. Non-rule interactions with confidence > 0.8 skip Stage 2, targeting 50%+ reduction in LLM calls. Key design decisions: - Prefilter is implemented in BaseEvaluator, so all evaluators get it - _parse_bool() fixes the bool("false")==True bug from PR Joaolfelicio#20 - Fail-open: prefilter errors pass through to full eval - --skip-prefilter CLI flag to disable Stage 1 - Dashboard shows prefilter skip rate metrics Closes Joaolfelicio#14 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Implements a two-stage evaluation pipeline by adding a lightweight prefilter step in BaseEvaluator to decide whether to skip full rule extraction, plus CLI/dashboard plumbing to expose prefilter behavior.

Changes:

Added PrefilterResult / PrefilterMetrics models and integrated Stage 1 prefiltering into BaseEvaluator.evaluate_interaction().
Added --skip-prefilter flag and surfaced prefilter skipped/total stats in the dashboard footer.
Added a new prefilter prompt template and a new test suite covering prefilter logic and _parse_bool().

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`context_scribe/evaluator/base_evaluator.py`	Adds prefilter stage, metrics tracking, and `_parse_bool()` for LLM JSON booleans.
`context_scribe/models/evaluator_models.py`	Introduces prefilter dataclasses and metrics helpers.
`context_scribe/evaluator/prefilter_template.md`	Provides the Stage 1 prompt template for rule-bearing classification.
`context_scribe/evaluator/__init__.py`	Updates `get_evaluator()` to accept and forward `**kwargs`.
`context_scribe/main.py`	Wires `--skip-prefilter` through to evaluator creation and displays prefilter stats in the dashboard.
`tests/test_prefilter.py`	Adds unit + integration-style tests for prefilter and boolean parsing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

All evaluator subclasses now accept and forward **kwargs to BaseEvaluator.__init__(), allowing skip_prefilter and future params to propagate correctly via get_evaluator(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

don-petry · 2026-04-05T17:20:14Z

@Joaolfelicio - What do you think about this enhancement idea?

- test_daemons.py: patch get_evaluator instead of removed direct imports - test_gemini_cli_llm.py: use skip_prefilter=True in CLI flag test (prefilter adds a second subprocess.run call) - main.py: guard prefilter metrics sync with isinstance check to prevent TypeError when evaluator is mocked Found via full test suite run with Python 3.12 + mcp installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address two Copilot review comments: 1. _parse_bool() now returns None for unrecognised/null values instead of defaulting to False. _pre_evaluate() treats None as pass-through to full evaluation, ensuring malformed LLM output doesn't silently skip rule extraction (fail-open behaviour). 2. Template loading in BaseEvaluator.__init__() now uses importlib.resources.files() instead of __file__-relative Path reads, so templates are accessible in packaged installs (wheel/zip). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

don-petry

Automated review — NEEDS HUMAN REVIEW

Risk: MEDIUM
Reviewed commit: 987bb4693058055ba6e91a2a8e5f75b9b2c9b10a
Council vote: security=MEDIUM · correctness=MEDIUM · maintainability=MEDIUM

Summary

The two-stage prefilter pipeline is well-structured, following existing BaseEvaluator patterns, and CI is fully green. The security lens found no new vulnerabilities; format-string and **kwargs patterns are pre-existing. The correctness lens confirms the implementation matches issue #14's acceptance criteria with all prior Copilot findings resolved. The maintainability lens flags four minor issues — missing docstrings, a magic confidence threshold, test setup duplication, and a test/production inconsistency in template loading — plus an unanswered question from collaborator don-petry that gates merge under the repo's review policy.

Linked issue analysis

Issue #14 (two-stage evaluation pipeline with prefilter): The diff implements _prefilter_interaction() in BaseEvaluator with fail-open semantics, PrefilterResult/PrefilterMetrics models, a --skip-prefilter CLI flag, dashboard prefilter stats, a prefilter prompt template, and a 244-line test suite covering boundary conditions, error paths, and the skip flag. All Copilot first-cycle findings (constructor kwarg propagation, _parse_bool fail-open, importlib.resources template loading) were addressed in follow-up commits.

Findings

Minor

[minor] correctness + maintainability · tests/test_prefilter.py:120-121 — Tests load templates via Path(__file__).parent.parent rather than importlib.resources.files(), inconsistent with the production path changed in this PR. The importlib code path is not exercised by any test; a missing package-data entry would only surface at install time. (Both correctness and maintainability lenses flagged this.)
[minor] maintainability · tests/test_prefilter.py — ~15 lines of setup code (patching __init__, reading template files) are duplicated across 4 test functions. A pytest fixture would reduce coupling and clarify intent for future maintainers.
[minor] correctness · context_scribe/evaluator/base_evaluator.py — The prefilter JSON extraction regex uses [^}]* which silently fails if the LLM wraps output in a parent JSON object with nested braces. Fail-open behavior makes this non-breaking in practice.
[minor] maintainability · context_scribe/models/evaluator_models.py:27 — Confidence threshold 0.8 in PrefilterResult.should_skip_full_eval is a magic number with no named constant or config surface for per-evaluator or per-deployment tuning.
[minor] maintainability · context_scribe/models/evaluator_models.py — CONTRIBUTING.md requires Google-style docstrings for public classes/methods. PrefilterResult.should_skip_full_eval (property) and PrefilterMetrics.record_result are missing docstrings needed for auto-generated docs and for authors extending BaseEvaluator.

Info

[info] security · context_scribe/evaluator/base_evaluator.py:56 — Prefilter template uses str.format() with user-controlled interaction.content; curly-brace patterns in content could cause KeyError (DoS of that evaluation) or leak internal_signature. Pre-existing pattern from the full-evaluation path; not newly introduced.
[info] security · context_scribe/evaluator/base_evaluator.py:18 — Prefilter is correctly fail-open: timeouts, parse errors, and unrecognized _parse_bool values all pass through to full evaluation. Appropriate since the prefilter is a cost optimisation, not a security gate.
[info] security · context_scribe/evaluator/__init__.py:16 — Open **kwargs in get_evaluator() allows arbitrary kwargs to propagate to BaseEvaluator.__init__(). No injection risk at present (all call sites are internal CLI via Click).
[info] correctness · context_scribe/models/evaluator_models.py:45 — PrefilterMetrics.record_result() increments both prefilter_errors and prefilter_passed when result is None, conflating genuine passes with error fall-throughs in the dashboard display. Skip-rate calculation remains correct.
[info] correctness + maintainability — Collaborator don-petry left an unanswered question for @Joaolfelicio (2026-04-05) about an enhancement idea. Not a change request, but the repo's review policy gates on all reviewer questions being resolved. (Both correctness and maintainability lenses flagged this.)
[info] maintainability · context_scribe/main.py:263 — PrefilterMetrics.prefilter_errors is tracked but never surfaced in the dashboard or logged at INFO level; silent failures would be invisible in production.
[info] maintainability · context_scribe/main.py:265 — The guard isinstance(getattr(metrics, 'prefilter_passed', None), int) is unusual; isinstance(metrics, PrefilterMetrics) would be more idiomatic and self-documenting.

CI status

All CI checks are green (pip-audit, full test suite). No failing or pending checks.

Reviewed automatically by the don-petry PR-review council (security: opus 4.6 · correctness: sonnet 4.6 · maintainability: sonnet 4.6 · synthesis: sonnet 4.6). The marker on line 1 lets the agent detect new commits and re-review. Reply with @don-petry if you need a human.

don-petry · 2026-04-10T15:44:18Z

Replaced by #29 — the fork (don-petry/context-scribe) is archived and read-only, so this branch can't be updated. #29 contains the same changes rebased on current main (including PR #19's merge) with all review comments addressed.

Copilot AI review requested due to automatic review settings April 4, 2026 23:41

Copilot started reviewing on behalf of don-petry April 4, 2026 23:41 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

Comment thread context_scribe/evaluator/__init__.py

Comment thread context_scribe/main.py

Comment thread tests/test_prefilter.py

Copilot AI review requested due to automatic review settings April 5, 2026 17:28

Copilot started reviewing on behalf of don-petry April 5, 2026 17:29 View session

Copilot AI reviewed Apr 5, 2026

View reviewed changes

Comment thread context_scribe/evaluator/base_evaluator.py

Comment thread context_scribe/evaluator/base_evaluator.py Outdated

don-petry commented Apr 10, 2026

View reviewed changes

don-petry added the needs-human-review Flagged by automated PR review agent label Apr 10, 2026

don-petry mentioned this pull request Apr 10, 2026

feat: two-stage evaluation pipeline with prefilter (#14) #29

Merged

don-petry closed this Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: two-stage evaluation pipeline with prefilter (#14)#21

feat: two-stage evaluation pipeline with prefilter (#14)#21
don-petry wants to merge 4 commits intoJoaolfelicio:mainfrom
don-petry:feat/prefilter-pipeline

don-petry commented Apr 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

don-petry commented Apr 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

don-petry left a comment

Uh oh!

don-petry commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

don-petry commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Summary

Testing evidence (live, Python 3.12 + mcp, Claude CLI)

Live test details

Issues found and fixed during testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

don-petry commented Apr 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

don-petry left a comment

Choose a reason for hiding this comment

Automated review — NEEDS HUMAN REVIEW

Summary

Linked issue analysis

Findings

Minor

Info

CI status

Uh oh!

don-petry commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

don-petry commented Apr 4, 2026 •

edited

Loading