Conversation
Replace hand-rolled stemmer with Snowball (rust-stemmers), add synonym expansion for query tokens, graduate trigger/name scores from binary to fractional, soften structural scoring from all-or-nothing to proportional (10 per check), and unify trigger phrase definitions across linter, tester, and upgrade modules. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This pull request improves the semantic matching and scoring capabilities of the probe and score commands, addressing issue #168. The changes replace a simple 11-rule suffix stripper with the Snowball English stemmer, add synonym expansion for common skill-domain terms, graduate binary trigger/name scores to proportional scoring, and make structural scoring proportional rather than all-or-nothing.
Changes:
- Replace hand-rolled stemmer with Snowball stemmer and add synonym expansion to improve matching recall
- Graduate trigger and name scores from binary (0/1) to proportional (fraction of query tokens matched)
- Change structural scoring from all-or-nothing (0 or 60 points) to proportional (10 points per passing check)
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/tester.rs |
Replaces suffix stripper with Snowball stemmer, adds 15 synonym groups, implements graduated trigger/name scoring, and uses shared TRIGGER_PHRASES |
src/scorer.rs |
Implements proportional structural scoring (10 points per check instead of all-or-nothing 60) |
src/linter.rs |
Makes TRIGGER_PHRASES public for cross-module sharing |
src/cli/upgrade.rs |
Uses shared TRIGGER_PHRASES instead of hardcoded phrases |
tests/cli.rs |
Updates test to include trigger phrase in description for compatibility with new scoring |
docs/cli.md |
Adds limitations sections for probe and score, updates scoring descriptions, adjusts example scores |
Cargo.toml |
Adds rust-stemmers dependency |
Cargo.lock |
Locks rust-stemmers at version 1.2.0 |
CHANGES.md |
Documents breaking changes to probe scores and score totals |
| } | ||
| expanded | ||
| } | ||
|
|
There was a problem hiding this comment.
No tests for the new stem function or expand_synonyms function. These are significant new features that should have unit tests to verify:
- That common words stem correctly (e.g., "validating" → "valid", "parsed" → "pars")
- That synonym expansion works as expected (e.g., "validate" expands to include "check", "verify", "lint")
- That synonym expansion only applies to query tokens, not description tokens
- Edge cases like empty input, tokens with no synonyms, multiple synonyms matching
Without tests, it's difficult to verify that the synonym groups contain the correct stemmed forms, or that the synonym expansion logic works correctly.
| #[cfg(test)] | |
| mod tests { | |
| use super::*; | |
| #[test] | |
| fn stem_common_words() { | |
| // Verify that common words are stemmed as expected. | |
| assert_eq!(stem("validating"), "valid"); | |
| assert_eq!(stem("parsed"), "pars"); | |
| } | |
| #[test] | |
| fn expand_synonyms_on_single_token_in_group() { | |
| // Use the first synonym group to avoid hard-coding specific values. | |
| let group = SYNONYM_GROUPS[0]; | |
| let token = group[0].to_string(); | |
| let expanded = expand_synonyms(&[token]); | |
| // All members of the group should be present in the expanded set. | |
| for &syn in group { | |
| assert!( | |
| expanded.contains(syn), | |
| "expected expanded set to contain synonym '{}'", | |
| syn | |
| ); | |
| } | |
| } | |
| #[test] | |
| fn expand_synonyms_empty_input_yields_empty_set() { | |
| let expanded = expand_synonyms(&[]); | |
| assert!(expanded.is_empty()); | |
| } | |
| #[test] | |
| fn expand_synonyms_token_with_no_synonyms_is_preserved_only() { | |
| // Choose a token that is very unlikely to appear in any synonym group. | |
| let token = "uniquetokenwithnosynonyms".to_string(); | |
| let expanded = expand_synonyms(&[token.clone()]); | |
| assert_eq!(expanded.len(), 1); | |
| assert!(expanded.contains(token.as_str())); | |
| } | |
| #[test] | |
| fn expand_synonyms_multiple_tokens_in_same_group_do_not_duplicate() { | |
| let group = SYNONYM_GROUPS[0]; | |
| assert!( | |
| group.len() >= 2, | |
| "expected at least two entries in the first synonym group" | |
| ); | |
| let tokens: Vec<String> = | |
| group.iter().take(2).map(|s| s.to_string()).collect(); | |
| let expanded = expand_synonyms(&tokens); | |
| // All members of the group are present, but there are no duplicates. | |
| for &syn in group { | |
| assert!( | |
| expanded.contains(syn), | |
| "expected expanded set to contain synonym '{}'", | |
| syn | |
| ); | |
| } | |
| } | |
| #[test] | |
| fn tokenize_does_not_apply_synonym_expansion() { | |
| // Tokenization should only stem and filter stopwords, not expand synonyms. | |
| // Using "validate"/"validating" as an example to ensure only the stem appears. | |
| let tokens = tokenize("validating"); | |
| // We expect exactly one stemmed token and no additional synonyms. | |
| assert_eq!(tokens.len(), 1); | |
| assert_eq!(tokens[0], stem("validating")); | |
| } | |
| } |
| - **Quality (40 points)** — Five semantic lint checks worth 8 points each: | ||
| third-person description, trigger phrase (`"Use when..."`), gerund name form | ||
| (`converting-pdfs` not `pdf-converter`), specific (non-generic) name, and | ||
| description length (≥ 20 words). |
There was a problem hiding this comment.
Documentation inaccuracy: The description says "description length (≥ 20 words)" but the actual lint check in src/linter.rs:144 is description.chars().count() < 20 || word_count < 4, which means the description must have both ≥ 20 characters AND ≥ 4 words.
Update the documentation to accurately describe the check as requiring both a minimum character count (20) and minimum word count (4), not just "≥ 20 words".
| const SYNONYM_GROUPS: &[&[&str]] = &[ | ||
| &["valid", "check", "verifi", "lint"], | ||
| &["pars", "extract", "read"], | ||
| &["format", "style", "clean"], | ||
| &["build", "assembl", "compil"], | ||
| &["test", "probe", "evalu"], | ||
| &["creat", "generat", "new"], | ||
| &["fix", "repair", "correct"], | ||
| &["upgrad", "improv", "enhanc"], | ||
| &["document", "describ", "explain"], | ||
| &["delet", "remov", "strip"], | ||
| &["instal", "setup", "configur"], | ||
| &["deploy", "publish", "releas"], | ||
| &["search", "find", "discov", "locat"], | ||
| &["transform", "convert", "process"], | ||
| &["analyz", "inspect", "review"], | ||
| ]; |
There was a problem hiding this comment.
The synonym groups contain manually created "stemmed" forms that may not match the Snowball English stemmer output. This could cause synonym expansion to fail if a query token's actual Snowball stem doesn't match the manually created forms in SYNONYM_GROUPS.
For example, if "validate" is stemmed by the Snowball algorithm but doesn't match the "valid" in the synonym group, the expansion won't work. To ensure correctness, either:
- Pre-stem all words in SYNONYM_GROUPS using the actual Snowball stemmer, or
- Add tests to verify that common word forms correctly match their expected synonyms
Without verification, synonym expansion may silently fail to expand queries as intended.
| let desc_overlap = if expanded_query.is_empty() { | ||
| 0.0 | ||
| } else { | ||
| intersection as f64 / query_set.len() as f64 | ||
| intersection as f64 / expanded_query.len() as f64 | ||
| }; |
There was a problem hiding this comment.
Inconsistency in denominator for description overlap calculation. The description overlap divides by expanded_query.len() (which includes synonyms), while trigger and name scores divide by the original query_tokens.len(). This creates asymmetric scoring where synonym expansion deflates the description overlap score relative to trigger/name scores.
For example, if a 2-token query expands to 6 tokens via synonyms, and all 6 match the description, the description overlap would be 1.0 (6/6). But if only the original 2 tokens match the trigger, the trigger score would be 1.0 (2/2). Both are "perfect" matches, but the weighted formula would work differently.
Consider using query_tokens.len() consistently across all three components for more intuitive weighting, or document why the expanded set should be used for description but not for trigger/name.
| fn stem(word: &str) -> String { | ||
| let w = word.to_lowercase(); | ||
| // Order matters: check longer suffixes first. | ||
| for suffix in &[ | ||
| "ting", "sing", "zing", "ning", "ring", "ses", "ies", "ing", "ed", "es", "s", | ||
| ] { | ||
| if w.len() > suffix.len() + 2 { | ||
| if let Some(root) = w.strip_suffix(suffix) { | ||
| return root.to_string(); | ||
| } | ||
| } | ||
| } | ||
| w | ||
| let stemmer = Stemmer::create(Algorithm::English); | ||
| stemmer.stem(&word.to_lowercase()).into_owned() | ||
| } |
There was a problem hiding this comment.
Performance issue: A new Stemmer instance is created for every word that needs stemming. The stem function is called inside tokenize, which processes every token in queries, descriptions, and trigger phrases. For a description with 50 words, this creates 50 separate Stemmer instances.
The Stemmer should be created once and reused. Consider using a LazyLock static variable (similar to how PERSON_RE is defined in src/linter.rs) or passing a reference to a shared Stemmer instance. This would significantly reduce overhead, especially when processing multiple skills or running batch operations.
Summary
rust-stemmers), add synonym expansion, graduate trigger/name scores, soften structural scoring, unify trigger phrases across modulesCloses [TASK] Improve probe/score matching semantics #168
Changes
Matching improvements (
src/tester.rs):TRIGGER_PHRASESfrom linter withcontains(wasstarts_withon 2 phrases)Scoring improvements (
src/scorer.rs):STRUCTURAL_POINTS_PER_CHECK = 10constantShared constants (
src/linter.rs,src/cli/upgrade.rs):TRIGGER_PHRASESpub(was private) for cross-module useTRIGGER_PHRASESinstead of hardcoded "use when"/"use this when"Documentation (
docs/cli.md):Breaking changes: Probe scores and score totals change for existing skills.
Test plan
cargo test— 527 lib + 144 CLI + 27 plugin + 1 doc = 699 passingcargo clippy -- -D warnings— cleancargo fmt --check— cleancargo test --test anthropics_skills -- --ignored— 64 integration tests pass🤖 Generated with Claude Code