Skip to content

feat: two-stage evaluation pipeline with prefilter (#14)#21

Closed
don-petry wants to merge 4 commits intoJoaolfelicio:mainfrom
don-petry:feat/prefilter-pipeline
Closed

feat: two-stage evaluation pipeline with prefilter (#14)#21
don-petry wants to merge 4 commits intoJoaolfelicio:mainfrom
don-petry:feat/prefilter-pipeline

Conversation

@don-petry
Copy link
Copy Markdown
Collaborator

@don-petry don-petry commented Apr 4, 2026

Why

Every user interaction is currently sent through full LLM evaluation, even routine code-assistance messages that contain no persistent preferences or rules. This wastes LLM calls and adds unnecessary latency and cost. A lightweight classification step can filter out the majority of non-rule interactions cheaply, targeting a 50%+ reduction in full evaluation calls.

Summary

  • Adds a lightweight Stage 1 pre-filter in BaseEvaluator that classifies interactions as rule-bearing or not before running full extraction
  • Non-rule interactions with confidence > 0.8 skip Stage 2 entirely
  • --skip-prefilter CLI flag to disable Stage 1
  • Dashboard shows prefilter skip rate metrics
  • Fixes bool("false")==True bug from PR feat: two-stage evaluation pipeline with prefilter #20 with proper _parse_bool() function
  • All evaluator subclasses updated to accept and forward **kwargs

Replaces PR #20 (rebuilt cleanly on upstream/main using BaseEvaluator pattern).

Closes #14

Testing evidence (live, Python 3.12 + mcp, Claude CLI)

Check Result
Full test suite: 84/84 passed ✅ PASS
Live: non-rule interaction skipped by prefilter (5.7s, 1 LLM call) ✅ PASS
Live: rule interaction passed through, extracted correctly (13.2s, 2 LLM calls) ✅ PASS
Live: skip_prefilter=True bypasses Stage 1 (metrics untouched) ✅ PASS
Live: prefilter skip rate = 50% (1 skipped / 2 total) ✅ PASS
Live: extracted rule scope=GLOBAL, content includes snake_case directive ✅ PASS

Live test details

Test 1 (non-rule "help me debug"): prefilter skipped=1, result=None, 5.7s
Test 2 (rule "use snake_case"):    prefilter passed=1, scope=GLOBAL, 13.2s
Test 3 (skip_prefilter=True):      metrics 0/0, full eval only, 12.4s

Issues found and fixed during testing

  • test_daemons.py: patched evaluator classes no longer in main.py → fixed to patch get_evaluator
  • test_gemini_cli_llm.py: prefilter adds 2nd subprocess call → fixed with skip_prefilter=True
  • Dashboard generate_layout: MagicMock metrics caused TypeError → fixed with isinstance guard

🤖 Generated with Claude Code

Adds a lightweight Stage 1 pre-filter to BaseEvaluator that classifies
interactions as rule-bearing or not before running full extraction.
Non-rule interactions with confidence > 0.8 skip Stage 2, targeting
50%+ reduction in LLM calls.

Key design decisions:
- Prefilter is implemented in BaseEvaluator, so all evaluators get it
- _parse_bool() fixes the bool("false")==True bug from PR Joaolfelicio#20
- Fail-open: prefilter errors pass through to full eval
- --skip-prefilter CLI flag to disable Stage 1
- Dashboard shows prefilter skip rate metrics

Closes Joaolfelicio#14

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 4, 2026 23:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a two-stage evaluation pipeline by adding a lightweight prefilter step in BaseEvaluator to decide whether to skip full rule extraction, plus CLI/dashboard plumbing to expose prefilter behavior.

Changes:

  • Added PrefilterResult / PrefilterMetrics models and integrated Stage 1 prefiltering into BaseEvaluator.evaluate_interaction().
  • Added --skip-prefilter flag and surfaced prefilter skipped/total stats in the dashboard footer.
  • Added a new prefilter prompt template and a new test suite covering prefilter logic and _parse_bool().

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
context_scribe/evaluator/base_evaluator.py Adds prefilter stage, metrics tracking, and _parse_bool() for LLM JSON booleans.
context_scribe/models/evaluator_models.py Introduces prefilter dataclasses and metrics helpers.
context_scribe/evaluator/prefilter_template.md Provides the Stage 1 prompt template for rule-bearing classification.
context_scribe/evaluator/__init__.py Updates get_evaluator() to accept and forward **kwargs.
context_scribe/main.py Wires --skip-prefilter through to evaluator creation and displays prefilter stats in the dashboard.
tests/test_prefilter.py Adds unit + integration-style tests for prefilter and boolean parsing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread context_scribe/evaluator/__init__.py
Comment thread context_scribe/main.py
Comment thread tests/test_prefilter.py
All evaluator subclasses now accept and forward **kwargs to
BaseEvaluator.__init__(), allowing skip_prefilter and future
params to propagate correctly via get_evaluator().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@don-petry
Copy link
Copy Markdown
Collaborator Author

@Joaolfelicio - What do you think about this enhancement idea?

- test_daemons.py: patch get_evaluator instead of removed direct imports
- test_gemini_cli_llm.py: use skip_prefilter=True in CLI flag test
  (prefilter adds a second subprocess.run call)
- main.py: guard prefilter metrics sync with isinstance check to
  prevent TypeError when evaluator is mocked

Found via full test suite run with Python 3.12 + mcp installed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 5, 2026 17:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread context_scribe/evaluator/base_evaluator.py
Comment thread context_scribe/evaluator/base_evaluator.py Outdated
Address two Copilot review comments:

1. _parse_bool() now returns None for unrecognised/null values instead
   of defaulting to False. _pre_evaluate() treats None as pass-through
   to full evaluation, ensuring malformed LLM output doesn't silently
   skip rule extraction (fail-open behaviour).

2. Template loading in BaseEvaluator.__init__() now uses
   importlib.resources.files() instead of __file__-relative Path reads,
   so templates are accessible in packaged installs (wheel/zip).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@don-petry don-petry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review — NEEDS HUMAN REVIEW

Risk: MEDIUM
Reviewed commit: 987bb4693058055ba6e91a2a8e5f75b9b2c9b10a
Council vote: security=MEDIUM · correctness=MEDIUM · maintainability=MEDIUM

Summary

The two-stage prefilter pipeline is well-structured, following existing BaseEvaluator patterns, and CI is fully green. The security lens found no new vulnerabilities; format-string and **kwargs patterns are pre-existing. The correctness lens confirms the implementation matches issue #14's acceptance criteria with all prior Copilot findings resolved. The maintainability lens flags four minor issues — missing docstrings, a magic confidence threshold, test setup duplication, and a test/production inconsistency in template loading — plus an unanswered question from collaborator don-petry that gates merge under the repo's review policy.

Linked issue analysis

Issue #14 (two-stage evaluation pipeline with prefilter): The diff implements _prefilter_interaction() in BaseEvaluator with fail-open semantics, PrefilterResult/PrefilterMetrics models, a --skip-prefilter CLI flag, dashboard prefilter stats, a prefilter prompt template, and a 244-line test suite covering boundary conditions, error paths, and the skip flag. All Copilot first-cycle findings (constructor kwarg propagation, _parse_bool fail-open, importlib.resources template loading) were addressed in follow-up commits.

Findings

Minor

  • [minor] correctness + maintainability · tests/test_prefilter.py:120-121 — Tests load templates via Path(__file__).parent.parent rather than importlib.resources.files(), inconsistent with the production path changed in this PR. The importlib code path is not exercised by any test; a missing package-data entry would only surface at install time. (Both correctness and maintainability lenses flagged this.)
  • [minor] maintainability · tests/test_prefilter.py — ~15 lines of setup code (patching __init__, reading template files) are duplicated across 4 test functions. A pytest fixture would reduce coupling and clarify intent for future maintainers.
  • [minor] correctness · context_scribe/evaluator/base_evaluator.py — The prefilter JSON extraction regex uses [^}]* which silently fails if the LLM wraps output in a parent JSON object with nested braces. Fail-open behavior makes this non-breaking in practice.
  • [minor] maintainability · context_scribe/models/evaluator_models.py:27 — Confidence threshold 0.8 in PrefilterResult.should_skip_full_eval is a magic number with no named constant or config surface for per-evaluator or per-deployment tuning.
  • [minor] maintainability · context_scribe/models/evaluator_models.pyCONTRIBUTING.md requires Google-style docstrings for public classes/methods. PrefilterResult.should_skip_full_eval (property) and PrefilterMetrics.record_result are missing docstrings needed for auto-generated docs and for authors extending BaseEvaluator.

Info

  • [info] security · context_scribe/evaluator/base_evaluator.py:56 — Prefilter template uses str.format() with user-controlled interaction.content; curly-brace patterns in content could cause KeyError (DoS of that evaluation) or leak internal_signature. Pre-existing pattern from the full-evaluation path; not newly introduced.
  • [info] security · context_scribe/evaluator/base_evaluator.py:18 — Prefilter is correctly fail-open: timeouts, parse errors, and unrecognized _parse_bool values all pass through to full evaluation. Appropriate since the prefilter is a cost optimisation, not a security gate.
  • [info] security · context_scribe/evaluator/__init__.py:16 — Open **kwargs in get_evaluator() allows arbitrary kwargs to propagate to BaseEvaluator.__init__(). No injection risk at present (all call sites are internal CLI via Click).
  • [info] correctness · context_scribe/models/evaluator_models.py:45PrefilterMetrics.record_result() increments both prefilter_errors and prefilter_passed when result is None, conflating genuine passes with error fall-throughs in the dashboard display. Skip-rate calculation remains correct.
  • [info] correctness + maintainability — Collaborator don-petry left an unanswered question for @Joaolfelicio (2026-04-05) about an enhancement idea. Not a change request, but the repo's review policy gates on all reviewer questions being resolved. (Both correctness and maintainability lenses flagged this.)
  • [info] maintainability · context_scribe/main.py:263PrefilterMetrics.prefilter_errors is tracked but never surfaced in the dashboard or logged at INFO level; silent failures would be invisible in production.
  • [info] maintainability · context_scribe/main.py:265 — The guard isinstance(getattr(metrics, 'prefilter_passed', None), int) is unusual; isinstance(metrics, PrefilterMetrics) would be more idiomatic and self-documenting.

CI status

All CI checks are green (pip-audit, full test suite). No failing or pending checks.


Reviewed automatically by the don-petry PR-review council (security: opus 4.6 · correctness: sonnet 4.6 · maintainability: sonnet 4.6 · synthesis: sonnet 4.6). The marker on line 1 lets the agent detect new commits and re-review. Reply with @don-petry if you need a human.

@don-petry don-petry added the needs-human-review Flagged by automated PR review agent label Apr 10, 2026
@don-petry
Copy link
Copy Markdown
Collaborator Author

Replaced by #29 — the fork (don-petry/context-scribe) is archived and read-only, so this branch can't be updated. #29 contains the same changes rebased on current main (including PR #19's merge) with all review comments addressed.

@don-petry don-petry closed this Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-human-review Flagged by automated PR review agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Two-stage evaluation pipeline for efficiency

2 participants