-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Proposal: Automated Evals for Skills
Problem
gstack has 6 skills. Each makes a specific claim about how it changes agent behavior. None of those claims are tested.
The browse binary has integration tests. The skills themselves — the actual product — have nothing. Skill quality is validated by vibes.
What can regress silently today:
- Someone edits
plan-ceo-review/SKILL.md, the diff looks reasonable, tests pass, it ships — and the skill quietly stops producing scope expansion thinking /reviewsays it runs a two-pass audit with critical/informational separation — does it? Every time? We don't know/shiphas a 10-step non-interactive pipeline — if step ordering regresses, the only signal is a user hitting it in production/retroclaims it computes metrics from git history and detects work sessions — but nothing verifies the output actually contains those analyses
The README itself says: "Passing tests do not mean the branch is safe." Right now, passing tests don't mean the skills work either.
Proposed Solution
Eval-driven skill testing. For each skill, define scripted scenarios, run them through Claude with the skill loaded, and verify the output exhibits the expected behaviors.
Architecture
eval scenario (.md) skill under test judge / assertions
───────────────── ──────────────── ──────────────────
Scripted project context → Claude runs with skill → Did the output exhibit
+ a specific task prompt loaded as system prompt the expected behaviors?
Three layers:
1. Scenario files
Each eval is a .md file that defines:
- A sample project context (fake codebase description, constraints, team situation)
- A specific task prompt (e.g. "add photo upload to the listing app")
- Which skill to invoke
2. Expected behaviors per skill
Each skill gets a behavior spec derived from its SKILL.md and README claims:
| Skill | Key Expected Behaviors |
|---|---|
/plan-ceo-review |
Challenges the literal request. Proposes scope expansion. Asks "what is this product actually for?" Finds the 10-star version. Produces delight opportunities. |
/plan-eng-review |
Produces architecture diagrams. Identifies failure modes. Defines system boundaries. Outputs test matrix. Respects user's chosen scope after Step 0. |
/review |
Two-pass structure (critical vs informational). Catches race conditions, trust boundaries, N+1 queries. Uses AskUserQuestion for critical findings. Does not nitpick style. |
/ship |
Non-interactive except for MINOR/MAJOR bumps. Runs tests before push. Creates PR with summary. Follows correct step ordering. |
/browse |
Starts server if needed. Uses snapshot refs. Takes screenshots for visual verification. Checks console for errors. |
/retro |
Computes metrics from git history. Detects work sessions. Produces tweetable summary. Saves JSON snapshot for trend tracking. |
3. Two evaluation methods, used together
Structural assertions — The output contains required sections, follows required ordering, includes diagrams (for eng-review), follows two-pass structure (for review). Cheap, deterministic, no API cost.
LLM-as-judge — A second Claude call evaluates whether the output genuinely exhibits the claimed behavior. Example: "Did the plan-ceo-review response actually challenge the literal request and propose a more ambitious version?" This catches the spirit, not just the letter.
Example: /plan-ceo-review eval
scenario: plan-ceo-review/eval-listing-upload.md
├── context: "You're building a Craigslist-style marketplace app..."
├── task: "Add seller photo upload to listings"
├── skill: plan-ceo-review
└── expected behaviors:
├── STRUCTURAL: output contains "NOT in scope" section
├── STRUCTURAL: output contains at least one diagram
├── STRUCTURAL: output contains "delight opportunities"
├── JUDGE: response challenges "photo upload" as the literal feature
├── JUDGE: response proposes capabilities beyond file picker + save
├── JUDGE: response considers the end-user job (listings that sell)
└── JUDGE: response does NOT just implement the ticket as stated
Example: /review eval
scenario: review/eval-pr-with-hidden-bugs.md
├── context: "Marketplace app PR adding listing photo enrichment..."
├── task: "Review this branch before merging"
├── skill: review
└── expected behaviors:
├── STRUCTURAL: output contains CRITICAL and INFORMATIONAL sections (two-pass)
├── STRUCTURAL: uses AskUserQuestion for each critical finding
├── STRUCTURAL: does NOT flag style-only issues as critical
├── JUDGE: identifies the N+1 query in photo rendering
├── JUDGE: catches the trust boundary issue (web data into prompt)
├── JUDGE: flags the race condition on cover photo selection
└── JUDGE: does NOT devolve into nitpicking variable names or formatting
Example: /retro eval
scenario: retro/eval-weekly-analysis.md
├── context: "Repo with 2 weeks of commit history across multiple contributors..."
├── task: "Run a weekly retrospective"
├── skill: retro
└── expected behaviors:
├── STRUCTURAL: output contains metrics table (commits, LOC, test ratio)
├── STRUCTURAL: output contains tweetable summary line
├── STRUCTURAL: output contains "wins" and "improvements" sections
├── JUDGE: correctly identifies work sessions from commit timestamps
├── JUDGE: detects hotspot files from change frequency
├── JUDGE: narrative reflects the actual commit data, not generic filler
└── JUDGE: saves JSON snapshot for trend tracking
Proposed directory structure
evals/
├── runner.ts # orchestrator
├── judge.ts # LLM-as-judge evaluator
├── scenarios/
│ ├── plan-ceo-review/
│ │ ├── listing-upload.md # scenario definition
│ │ └── behaviors.yaml # expected behaviors + pass criteria
│ ├── review/
│ │ ├── pr-with-hidden-bugs.md
│ │ └── behaviors.yaml
│ ├── retro/
│ │ ├── weekly-analysis.md
│ │ └── behaviors.yaml
│ └── ...
└── results/ # timestamped eval results
Proposed CLI
bun run eval # run all skill evals
bun run eval --skill review # run evals for one skill
bun run eval --judge-tier quick # structural checks only (no LLM judge)
bun run eval --judge-tier full # structural + LLM-as-judgeNote: /ship already references eval infrastructure in step 4 (EVAL_JUDGE_TIER=full). This proposal builds what that step assumes exists.
Phased Rollout
Phase 1 — Structural evals. No API cost. Fast. Catches gross regressions: skill stops producing diagrams, review loses two-pass structure, retro drops metrics table.
Phase 2 — LLM-as-judge. Catches behavioral regressions: plan-ceo-review stops challenging the literal request, review degrades into a style nitpick pass. Runs in CI when SKILL.md files change.
Phase 3 — Integration with /ship. Wire evals into step 4 so skill changes are gated by eval results before shipping.
Why this matters
Skills are the product. The browse binary is plumbing. Today we test the plumbing and ship the product untested.
Every prompt engineer knows the failure mode: you iterate on a prompt, it improves for your test case, and silently regresses on three others. Evals are the unit tests of prompt engineering. Without them, skill quality is a coin flip on every SKILL.md edit.
Open Questions
- Should scenarios use real codebases (snapshot repos) or synthetic project descriptions?
- What pass threshold makes sense for LLM-as-judge (determinism is imperfect)?
- Should eval results block merges, or just surface as warnings initially?
- Is there a preferred eval framework (Braintrust, promptfoo, custom) or should this be built from scratch with the Anthropic SDK?
Happy to implement this if the direction looks right.