Skip to content

Improve probe/score semantics#171

Merged
wkusnierczyk merged 2 commits intomainfrom
dev/168-probe-score
Feb 23, 2026
Merged

Improve probe/score semantics#171
wkusnierczyk merged 2 commits intomainfrom
dev/168-probe-score

Conversation

@wkusnierczyk
Copy link
Owner

Summary

Changes

Matching improvements (src/tester.rs):

  • Replace 11-rule suffix stripper with Snowball English stemmer for better inflection handling
  • Add 15 synonym groups for common skill-domain terms (expand query tokens only)
  • Graduate trigger score from binary (0/1) to fraction of query tokens matched
  • Graduate name score from binary (0/1) to fraction of query tokens matched
  • Use shared TRIGGER_PHRASES from linter with contains (was starts_with on 2 phrases)

Scoring improvements (src/scorer.rs):

  • Proportional structural scoring: 6 checks × 10 points each (was all-or-nothing 0 or 60)
  • Add STRUCTURAL_POINTS_PER_CHECK = 10 constant

Shared constants (src/linter.rs, src/cli/upgrade.rs):

  • Make TRIGGER_PHRASES pub (was private) for cross-module use
  • Upgrade module now uses shared TRIGGER_PHRASES instead of hardcoded "use when"/"use this when"

Documentation (docs/cli.md):

  • Add limitations notes for probe and score commands
  • Update structural scoring description from all-or-nothing to proportional
  • Update example scores to reflect new scoring

Breaking changes: Probe scores and score totals change for existing skills.

Test plan

  • cargo test — 527 lib + 144 CLI + 27 plugin + 1 doc = 699 passing
  • cargo clippy -- -D warnings — clean
  • cargo fmt --check — clean
  • cargo test --test anthropics_skills -- --ignored — 64 integration tests pass

🤖 Generated with Claude Code

Replace hand-rolled stemmer with Snowball (rust-stemmers), add synonym
expansion for query tokens, graduate trigger/name scores from binary to
fractional, soften structural scoring from all-or-nothing to proportional
(10 per check), and unify trigger phrase definitions across linter,
tester, and upgrade modules.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 23, 2026 16:42
@wkusnierczyk wkusnierczyk added task Task / chore medium Medium priority labels Feb 23, 2026
@wkusnierczyk wkusnierczyk self-assigned this Feb 23, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the semantics and documentation of the probe/score heuristics by upgrading token normalization (Snowball stemming), adding query-side synonym expansion, graduating trigger/name scoring from binary to proportional, softening structural scoring to be proportional per check, and centralizing trigger phrase detection across modules.

Changes:

  • Replace the hand-rolled stemmer with rust-stemmers (Snowball) and add query-side synonym expansion for probe matching.
  • Make trigger/name scoring proportional (fraction matched) and unify trigger phrase detection via shared TRIGGER_PHRASES.
  • Change structural scoring from all-or-nothing 60 points to proportional per-check points; update CLI docs/examples and changelog accordingly.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/cli.rs Updates a CLI test fixture description to include a trigger phrase line.
src/tester.rs Adds Snowball stemming, synonym expansion, and graduated trigger/name scoring; reuses shared trigger phrases.
src/scorer.rs Implements proportional structural scoring (10 points per passing structural check).
src/linter.rs Exposes TRIGGER_PHRASES publicly for shared use across modules.
src/cli/upgrade.rs Uses shared TRIGGER_PHRASES for trigger phrase detection.
docs/cli.md Documents new probe/score limitations and updates scoring descriptions/examples.
Cargo.toml Adds rust-stemmers dependency.
Cargo.lock Locks rust-stemmers and transitive deps.
CHANGES.md Notes breaking changes due to updated probe/score semantics.

…ural max

- Fix synonym expansion inflating denominator: use original query_set.len()
  instead of expanded_query.len() so synonyms can only help, never hurt
- Cache Snowball stemmer in LazyLock instead of creating per call
- Derive structural max from checks.len() * STRUCTURAL_POINTS_PER_CHECK
  at runtime; remove STRUCTURAL_MAX from production code
- Fix W002 double-counting: "No unknown fields" now checks W001 specifically
  instead of any warning
- Add dedicated synonym expansion tests (group expansion, unmatched tokens,
  Snowball stem verification, score non-regression)
- Restore strict Strong assertion in test_skill_returns_result_for_valid_skill

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wkusnierczyk
Copy link
Owner Author

wkusnierczyk commented Feb 23, 2026

Review response (b6220bd)

# Finding Severity Resolution
1 Synonym expansion inflates denominator, can reduce scores Medium Fixed — denominator now uses query_set.len() (original unique tokens) instead of expanded_query.len(). Synonyms can only help, never hurt.
2 Stemmer::create() called per word Low Fixed — cached in static LazyLock<Stemmer>.
3 STRUCTURAL_MAX can drift from checks count Low Fixed — removed from production code; max derived at runtime from checks.len() * STRUCTURAL_POINTS_PER_CHECK.
4 Probe test assertion relaxed (Strong | Weak) Low Fixed — restored strict Strong assertion now that denominator fix resolves the score regression.
5 Synonym groups lack dedicated tests Low Fixed — added 4 tests: group expansion, unmatched tokens, Snowball stem verification, score non-regression.
6 Deprecated assert_cmd API Low Out of scope — pre-existing across all test files (tracked separately).
7 W002 double-counting in "No unknown fields" check Info Fixed — now checks d.code == "W001" specifically instead of d.is_warning().

532 lib + 144 CLI + 27 plugin + 64 integration tests passing. Clippy + fmt clean.

@wkusnierczyk wkusnierczyk merged commit d451130 into main Feb 23, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

medium Medium priority task Task / chore

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TASK] Improve probe/score matching semantics

2 participants