Skip to content

Proposal: Automated Evals for Skills #24

@soviero

Description

@soviero

Proposal: Automated Evals for Skills

Problem

gstack has 6 skills. Each makes a specific claim about how it changes agent behavior. None of those claims are tested.

The browse binary has integration tests. The skills themselves — the actual product — have nothing. Skill quality is validated by vibes.

What can regress silently today:

  • Someone edits plan-ceo-review/SKILL.md, the diff looks reasonable, tests pass, it ships — and the skill quietly stops producing scope expansion thinking
  • /review says it runs a two-pass audit with critical/informational separation — does it? Every time? We don't know
  • /ship has a 10-step non-interactive pipeline — if step ordering regresses, the only signal is a user hitting it in production
  • /retro claims it computes metrics from git history and detects work sessions — but nothing verifies the output actually contains those analyses

The README itself says: "Passing tests do not mean the branch is safe." Right now, passing tests don't mean the skills work either.

Proposed Solution

Eval-driven skill testing. For each skill, define scripted scenarios, run them through Claude with the skill loaded, and verify the output exhibits the expected behaviors.

Architecture

eval scenario (.md)          skill under test          judge / assertions
─────────────────          ────────────────          ──────────────────
Scripted project context  →  Claude runs with skill  →  Did the output exhibit
+ a specific task prompt     loaded as system prompt     the expected behaviors?

Three layers:

1. Scenario files

Each eval is a .md file that defines:

  • A sample project context (fake codebase description, constraints, team situation)
  • A specific task prompt (e.g. "add photo upload to the listing app")
  • Which skill to invoke

2. Expected behaviors per skill

Each skill gets a behavior spec derived from its SKILL.md and README claims:

Skill Key Expected Behaviors
/plan-ceo-review Challenges the literal request. Proposes scope expansion. Asks "what is this product actually for?" Finds the 10-star version. Produces delight opportunities.
/plan-eng-review Produces architecture diagrams. Identifies failure modes. Defines system boundaries. Outputs test matrix. Respects user's chosen scope after Step 0.
/review Two-pass structure (critical vs informational). Catches race conditions, trust boundaries, N+1 queries. Uses AskUserQuestion for critical findings. Does not nitpick style.
/ship Non-interactive except for MINOR/MAJOR bumps. Runs tests before push. Creates PR with summary. Follows correct step ordering.
/browse Starts server if needed. Uses snapshot refs. Takes screenshots for visual verification. Checks console for errors.
/retro Computes metrics from git history. Detects work sessions. Produces tweetable summary. Saves JSON snapshot for trend tracking.

3. Two evaluation methods, used together

Structural assertions — The output contains required sections, follows required ordering, includes diagrams (for eng-review), follows two-pass structure (for review). Cheap, deterministic, no API cost.

LLM-as-judge — A second Claude call evaluates whether the output genuinely exhibits the claimed behavior. Example: "Did the plan-ceo-review response actually challenge the literal request and propose a more ambitious version?" This catches the spirit, not just the letter.

Example: /plan-ceo-review eval

scenario: plan-ceo-review/eval-listing-upload.md
├── context: "You're building a Craigslist-style marketplace app..."
├── task: "Add seller photo upload to listings"
├── skill: plan-ceo-review
└── expected behaviors:
    ├── STRUCTURAL: output contains "NOT in scope" section
    ├── STRUCTURAL: output contains at least one diagram
    ├── STRUCTURAL: output contains "delight opportunities"
    ├── JUDGE: response challenges "photo upload" as the literal feature
    ├── JUDGE: response proposes capabilities beyond file picker + save
    ├── JUDGE: response considers the end-user job (listings that sell)
    └── JUDGE: response does NOT just implement the ticket as stated

Example: /review eval

scenario: review/eval-pr-with-hidden-bugs.md
├── context: "Marketplace app PR adding listing photo enrichment..."
├── task: "Review this branch before merging"
├── skill: review
└── expected behaviors:
    ├── STRUCTURAL: output contains CRITICAL and INFORMATIONAL sections (two-pass)
    ├── STRUCTURAL: uses AskUserQuestion for each critical finding
    ├── STRUCTURAL: does NOT flag style-only issues as critical
    ├── JUDGE: identifies the N+1 query in photo rendering
    ├── JUDGE: catches the trust boundary issue (web data into prompt)
    ├── JUDGE: flags the race condition on cover photo selection
    └── JUDGE: does NOT devolve into nitpicking variable names or formatting

Example: /retro eval

scenario: retro/eval-weekly-analysis.md
├── context: "Repo with 2 weeks of commit history across multiple contributors..."
├── task: "Run a weekly retrospective"
├── skill: retro
└── expected behaviors:
    ├── STRUCTURAL: output contains metrics table (commits, LOC, test ratio)
    ├── STRUCTURAL: output contains tweetable summary line
    ├── STRUCTURAL: output contains "wins" and "improvements" sections
    ├── JUDGE: correctly identifies work sessions from commit timestamps
    ├── JUDGE: detects hotspot files from change frequency
    ├── JUDGE: narrative reflects the actual commit data, not generic filler
    └── JUDGE: saves JSON snapshot for trend tracking

Proposed directory structure

evals/
├── runner.ts                     # orchestrator
├── judge.ts                      # LLM-as-judge evaluator
├── scenarios/
│   ├── plan-ceo-review/
│   │   ├── listing-upload.md     # scenario definition
│   │   └── behaviors.yaml        # expected behaviors + pass criteria
│   ├── review/
│   │   ├── pr-with-hidden-bugs.md
│   │   └── behaviors.yaml
│   ├── retro/
│   │   ├── weekly-analysis.md
│   │   └── behaviors.yaml
│   └── ...
└── results/                      # timestamped eval results

Proposed CLI

bun run eval                      # run all skill evals
bun run eval --skill review       # run evals for one skill
bun run eval --judge-tier quick   # structural checks only (no LLM judge)
bun run eval --judge-tier full    # structural + LLM-as-judge

Note: /ship already references eval infrastructure in step 4 (EVAL_JUDGE_TIER=full). This proposal builds what that step assumes exists.

Phased Rollout

Phase 1 — Structural evals. No API cost. Fast. Catches gross regressions: skill stops producing diagrams, review loses two-pass structure, retro drops metrics table.

Phase 2 — LLM-as-judge. Catches behavioral regressions: plan-ceo-review stops challenging the literal request, review degrades into a style nitpick pass. Runs in CI when SKILL.md files change.

Phase 3 — Integration with /ship. Wire evals into step 4 so skill changes are gated by eval results before shipping.

Why this matters

Skills are the product. The browse binary is plumbing. Today we test the plumbing and ship the product untested.

Every prompt engineer knows the failure mode: you iterate on a prompt, it improves for your test case, and silently regresses on three others. Evals are the unit tests of prompt engineering. Without them, skill quality is a coin flip on every SKILL.md edit.

Open Questions

  • Should scenarios use real codebases (snapshot repos) or synthetic project descriptions?
  • What pass threshold makes sense for LLM-as-judge (determinism is imperfect)?
  • Should eval results block merges, or just surface as warnings initially?
  • Is there a preferred eval framework (Braintrust, promptfoo, custom) or should this be built from scratch with the Anthropic SDK?

Happy to implement this if the direction looks right.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions