Skip to content

Improve probe/score semantics#170

Closed
wkusnierczyk wants to merge 1 commit intomainfrom
improve-probe-score-semantics
Closed

Improve probe/score semantics#170
wkusnierczyk wants to merge 1 commit intomainfrom
improve-probe-score-semantics

Conversation

@wkusnierczyk
Copy link
Owner

Summary

Changes

Matching improvements (src/tester.rs):

  • Replace 11-rule suffix stripper with Snowball English stemmer for better inflection handling
  • Add 15 synonym groups for common skill-domain terms (expand query tokens only)
  • Graduate trigger score from binary (0/1) to fraction of query tokens matched
  • Graduate name score from binary (0/1) to fraction of query tokens matched
  • Use shared TRIGGER_PHRASES from linter with contains (was starts_with on 2 phrases)

Scoring improvements (src/scorer.rs):

  • Proportional structural scoring: 6 checks × 10 points each (was all-or-nothing 0 or 60)
  • Add STRUCTURAL_POINTS_PER_CHECK = 10 constant

Shared constants (src/linter.rs, src/cli/upgrade.rs):

  • Make TRIGGER_PHRASES pub (was private) for cross-module use
  • Upgrade module now uses shared TRIGGER_PHRASES instead of hardcoded "use when"/"use this when"

Documentation (docs/cli.md):

  • Add limitations notes for probe and score commands
  • Update structural scoring description from all-or-nothing to proportional
  • Update example scores to reflect new scoring

Breaking changes: Probe scores and score totals change for existing skills.

Test plan

  • cargo test — 527 lib + 144 CLI + 27 plugin + 1 doc = 699 passing
  • cargo clippy -- -D warnings — clean
  • cargo fmt --check — clean
  • cargo test --test anthropics_skills -- --ignored — 64 integration tests pass

🤖 Generated with Claude Code

Replace hand-rolled stemmer with Snowball (rust-stemmers), add synonym
expansion for query tokens, graduate trigger/name scores from binary to
fractional, soften structural scoring from all-or-nothing to proportional
(10 per check), and unify trigger phrase definitions across linter,
tester, and upgrade modules.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 23, 2026 16:36
@wkusnierczyk wkusnierczyk added task Task / chore medium Medium priority labels Feb 23, 2026
@wkusnierczyk wkusnierczyk self-assigned this Feb 23, 2026
@wkusnierczyk wkusnierczyk deleted the improve-probe-score-semantics branch February 23, 2026 16:41
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request improves the semantic matching and scoring capabilities of the probe and score commands, addressing issue #168. The changes replace a simple 11-rule suffix stripper with the Snowball English stemmer, add synonym expansion for common skill-domain terms, graduate binary trigger/name scores to proportional scoring, and make structural scoring proportional rather than all-or-nothing.

Changes:

  • Replace hand-rolled stemmer with Snowball stemmer and add synonym expansion to improve matching recall
  • Graduate trigger and name scores from binary (0/1) to proportional (fraction of query tokens matched)
  • Change structural scoring from all-or-nothing (0 or 60 points) to proportional (10 points per passing check)

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/tester.rs Replaces suffix stripper with Snowball stemmer, adds 15 synonym groups, implements graduated trigger/name scoring, and uses shared TRIGGER_PHRASES
src/scorer.rs Implements proportional structural scoring (10 points per check instead of all-or-nothing 60)
src/linter.rs Makes TRIGGER_PHRASES public for cross-module sharing
src/cli/upgrade.rs Uses shared TRIGGER_PHRASES instead of hardcoded phrases
tests/cli.rs Updates test to include trigger phrase in description for compatibility with new scoring
docs/cli.md Adds limitations sections for probe and score, updates scoring descriptions, adjusts example scores
Cargo.toml Adds rust-stemmers dependency
Cargo.lock Locks rust-stemmers at version 1.2.0
CHANGES.md Documents breaking changes to probe scores and score totals

}
expanded
}

Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests for the new stem function or expand_synonyms function. These are significant new features that should have unit tests to verify:

  1. That common words stem correctly (e.g., "validating" → "valid", "parsed" → "pars")
  2. That synonym expansion works as expected (e.g., "validate" expands to include "check", "verify", "lint")
  3. That synonym expansion only applies to query tokens, not description tokens
  4. Edge cases like empty input, tokens with no synonyms, multiple synonyms matching

Without tests, it's difficult to verify that the synonym groups contain the correct stemmed forms, or that the synonym expansion logic works correctly.

Suggested change
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn stem_common_words() {
// Verify that common words are stemmed as expected.
assert_eq!(stem("validating"), "valid");
assert_eq!(stem("parsed"), "pars");
}
#[test]
fn expand_synonyms_on_single_token_in_group() {
// Use the first synonym group to avoid hard-coding specific values.
let group = SYNONYM_GROUPS[0];
let token = group[0].to_string();
let expanded = expand_synonyms(&[token]);
// All members of the group should be present in the expanded set.
for &syn in group {
assert!(
expanded.contains(syn),
"expected expanded set to contain synonym '{}'",
syn
);
}
}
#[test]
fn expand_synonyms_empty_input_yields_empty_set() {
let expanded = expand_synonyms(&[]);
assert!(expanded.is_empty());
}
#[test]
fn expand_synonyms_token_with_no_synonyms_is_preserved_only() {
// Choose a token that is very unlikely to appear in any synonym group.
let token = "uniquetokenwithnosynonyms".to_string();
let expanded = expand_synonyms(&[token.clone()]);
assert_eq!(expanded.len(), 1);
assert!(expanded.contains(token.as_str()));
}
#[test]
fn expand_synonyms_multiple_tokens_in_same_group_do_not_duplicate() {
let group = SYNONYM_GROUPS[0];
assert!(
group.len() >= 2,
"expected at least two entries in the first synonym group"
);
let tokens: Vec<String> =
group.iter().take(2).map(|s| s.to_string()).collect();
let expanded = expand_synonyms(&tokens);
// All members of the group are present, but there are no duplicates.
for &syn in group {
assert!(
expanded.contains(syn),
"expected expanded set to contain synonym '{}'",
syn
);
}
}
#[test]
fn tokenize_does_not_apply_synonym_expansion() {
// Tokenization should only stem and filter stopwords, not expand synonyms.
// Using "validate"/"validating" as an example to ensure only the stem appears.
let tokens = tokenize("validating");
// We expect exactly one stemmed token and no additional synonyms.
assert_eq!(tokens.len(), 1);
assert_eq!(tokens[0], stem("validating"));
}
}

Copilot uses AI. Check for mistakes.
Comment on lines 517 to 520
- **Quality (40 points)** — Five semantic lint checks worth 8 points each:
third-person description, trigger phrase (`"Use when..."`), gerund name form
(`converting-pdfs` not `pdf-converter`), specific (non-generic) name, and
description length (≥ 20 words).
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation inaccuracy: The description says "description length (≥ 20 words)" but the actual lint check in src/linter.rs:144 is description.chars().count() < 20 || word_count < 4, which means the description must have both ≥ 20 characters AND ≥ 4 words.

Update the documentation to accurately describe the check as requiring both a minimum character count (20) and minimum word count (4), not just "≥ 20 words".

Copilot uses AI. Check for mistakes.
Comment on lines +257 to +273
const SYNONYM_GROUPS: &[&[&str]] = &[
&["valid", "check", "verifi", "lint"],
&["pars", "extract", "read"],
&["format", "style", "clean"],
&["build", "assembl", "compil"],
&["test", "probe", "evalu"],
&["creat", "generat", "new"],
&["fix", "repair", "correct"],
&["upgrad", "improv", "enhanc"],
&["document", "describ", "explain"],
&["delet", "remov", "strip"],
&["instal", "setup", "configur"],
&["deploy", "publish", "releas"],
&["search", "find", "discov", "locat"],
&["transform", "convert", "process"],
&["analyz", "inspect", "review"],
];
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The synonym groups contain manually created "stemmed" forms that may not match the Snowball English stemmer output. This could cause synonym expansion to fail if a query token's actual Snowball stem doesn't match the manually created forms in SYNONYM_GROUPS.

For example, if "validate" is stemmed by the Snowball algorithm but doesn't match the "valid" in the synonym group, the expansion won't work. To ensure correctness, either:

  1. Pre-stem all words in SYNONYM_GROUPS using the actual Snowball stemmer, or
  2. Add tests to verify that common word forms correctly match their expected synonyms

Without verification, synonym expansion may silently fail to expand queries as intended.

Copilot uses AI. Check for mistakes.
Comment on lines +357 to 361
let desc_overlap = if expanded_query.is_empty() {
0.0
} else {
intersection as f64 / query_set.len() as f64
intersection as f64 / expanded_query.len() as f64
};
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistency in denominator for description overlap calculation. The description overlap divides by expanded_query.len() (which includes synonyms), while trigger and name scores divide by the original query_tokens.len(). This creates asymmetric scoring where synonym expansion deflates the description overlap score relative to trigger/name scores.

For example, if a 2-token query expands to 6 tokens via synonyms, and all 6 match the description, the description overlap would be 1.0 (6/6). But if only the original 2 tokens match the trigger, the trigger score would be 1.0 (2/2). Both are "perfect" matches, but the weighted formula would work differently.

Consider using query_tokens.len() consistently across all three components for more intuitive weighting, or document why the expanded set should be used for description but not for trigger/name.

Copilot uses AI. Check for mistakes.
Comment on lines 276 to 279
fn stem(word: &str) -> String {
let w = word.to_lowercase();
// Order matters: check longer suffixes first.
for suffix in &[
"ting", "sing", "zing", "ning", "ring", "ses", "ies", "ing", "ed", "es", "s",
] {
if w.len() > suffix.len() + 2 {
if let Some(root) = w.strip_suffix(suffix) {
return root.to_string();
}
}
}
w
let stemmer = Stemmer::create(Algorithm::English);
stemmer.stem(&word.to_lowercase()).into_owned()
}
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance issue: A new Stemmer instance is created for every word that needs stemming. The stem function is called inside tokenize, which processes every token in queries, descriptions, and trigger phrases. For a description with 50 words, this creates 50 separate Stemmer instances.

The Stemmer should be created once and reused. Consider using a LazyLock static variable (similar to how PERSON_RE is defined in src/linter.rs) or passing a reference to a shared Stemmer instance. This would significantly reduce overhead, especially when processing multiple skills or running batch operations.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

medium Medium priority task Task / chore

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TASK] Improve probe/score matching semantics

2 participants