Proposal: Automated Evals for Skills

# Proposal: Automated Evals for Skills

## Problem

gstack has 6 skills. Each makes a specific claim about how it changes agent behavior. None of those claims are tested.

The browse binary has integration tests. The skills themselves — the actual product — have nothing. Skill quality is validated by vibes.

**What can regress silently today:**

- Someone edits `plan-ceo-review/SKILL.md`, the diff looks reasonable, tests pass, it ships — and the skill quietly stops producing scope expansion thinking
- `/review` says it runs a two-pass audit with critical/informational separation — does it? Every time? We don't know
- `/ship` has a 10-step non-interactive pipeline — if step ordering regresses, the only signal is a user hitting it in production
- `/retro` claims it computes metrics from git history and detects work sessions — but nothing verifies the output actually contains those analyses

The README itself says: *"Passing tests do not mean the branch is safe."* Right now, passing tests don't mean the skills work either.

## Proposed Solution

Eval-driven skill testing. For each skill, define scripted scenarios, run them through Claude with the skill loaded, and verify the output exhibits the expected behaviors.

### Architecture

```
eval scenario (.md)          skill under test          judge / assertions
─────────────────          ────────────────          ──────────────────
Scripted project context  →  Claude runs with skill  →  Did the output exhibit
+ a specific task prompt     loaded as system prompt     the expected behaviors?
```

Three layers:

### 1. Scenario files

Each eval is a `.md` file that defines:

- A sample project context (fake codebase description, constraints, team situation)
- A specific task prompt (e.g. "add photo upload to the listing app")
- Which skill to invoke

### 2. Expected behaviors per skill

Each skill gets a behavior spec derived from its SKILL.md and README claims:

| Skill | Key Expected Behaviors |
|-------|----------------------|
| `/plan-ceo-review` | Challenges the literal request. Proposes scope expansion. Asks "what is this product actually for?" Finds the 10-star version. Produces delight opportunities. |
| `/plan-eng-review` | Produces architecture diagrams. Identifies failure modes. Defines system boundaries. Outputs test matrix. Respects user's chosen scope after Step 0. |
| `/review` | Two-pass structure (critical vs informational). Catches race conditions, trust boundaries, N+1 queries. Uses AskUserQuestion for critical findings. Does not nitpick style. |
| `/ship` | Non-interactive except for MINOR/MAJOR bumps. Runs tests before push. Creates PR with summary. Follows correct step ordering. |
| `/browse` | Starts server if needed. Uses snapshot refs. Takes screenshots for visual verification. Checks console for errors. |
| `/retro` | Computes metrics from git history. Detects work sessions. Produces tweetable summary. Saves JSON snapshot for trend tracking. |

### 3. Two evaluation methods, used together

**Structural assertions** — The output contains required sections, follows required ordering, includes diagrams (for eng-review), follows two-pass structure (for review). Cheap, deterministic, no API cost.

**LLM-as-judge** — A second Claude call evaluates whether the output genuinely exhibits the claimed behavior. Example: "Did the plan-ceo-review response actually challenge the literal request and propose a more ambitious version?" This catches the spirit, not just the letter.

### Example: `/plan-ceo-review` eval

```
scenario: plan-ceo-review/eval-listing-upload.md
├── context: "You're building a Craigslist-style marketplace app..."
├── task: "Add seller photo upload to listings"
├── skill: plan-ceo-review
└── expected behaviors:
    ├── STRUCTURAL: output contains "NOT in scope" section
    ├── STRUCTURAL: output contains at least one diagram
    ├── STRUCTURAL: output contains "delight opportunities"
    ├── JUDGE: response challenges "photo upload" as the literal feature
    ├── JUDGE: response proposes capabilities beyond file picker + save
    ├── JUDGE: response considers the end-user job (listings that sell)
    └── JUDGE: response does NOT just implement the ticket as stated
```

### Example: `/review` eval

```
scenario: review/eval-pr-with-hidden-bugs.md
├── context: "Marketplace app PR adding listing photo enrichment..."
├── task: "Review this branch before merging"
├── skill: review
└── expected behaviors:
    ├── STRUCTURAL: output contains CRITICAL and INFORMATIONAL sections (two-pass)
    ├── STRUCTURAL: uses AskUserQuestion for each critical finding
    ├── STRUCTURAL: does NOT flag style-only issues as critical
    ├── JUDGE: identifies the N+1 query in photo rendering
    ├── JUDGE: catches the trust boundary issue (web data into prompt)
    ├── JUDGE: flags the race condition on cover photo selection
    └── JUDGE: does NOT devolve into nitpicking variable names or formatting
```

### Example: `/retro` eval

```
scenario: retro/eval-weekly-analysis.md
├── context: "Repo with 2 weeks of commit history across multiple contributors..."
├── task: "Run a weekly retrospective"
├── skill: retro
└── expected behaviors:
    ├── STRUCTURAL: output contains metrics table (commits, LOC, test ratio)
    ├── STRUCTURAL: output contains tweetable summary line
    ├── STRUCTURAL: output contains "wins" and "improvements" sections
    ├── JUDGE: correctly identifies work sessions from commit timestamps
    ├── JUDGE: detects hotspot files from change frequency
    ├── JUDGE: narrative reflects the actual commit data, not generic filler
    └── JUDGE: saves JSON snapshot for trend tracking
```

### Proposed directory structure

```
evals/
├── runner.ts                     # orchestrator
├── judge.ts                      # LLM-as-judge evaluator
├── scenarios/
│   ├── plan-ceo-review/
│   │   ├── listing-upload.md     # scenario definition
│   │   └── behaviors.yaml        # expected behaviors + pass criteria
│   ├── review/
│   │   ├── pr-with-hidden-bugs.md
│   │   └── behaviors.yaml
│   ├── retro/
│   │   ├── weekly-analysis.md
│   │   └── behaviors.yaml
│   └── ...
└── results/                      # timestamped eval results
```

### Proposed CLI

```bash
bun run eval                      # run all skill evals
bun run eval --skill review       # run evals for one skill
bun run eval --judge-tier quick   # structural checks only (no LLM judge)
bun run eval --judge-tier full    # structural + LLM-as-judge
```

Note: `/ship` already references eval infrastructure in step 4 (`EVAL_JUDGE_TIER=full`). This proposal builds what that step assumes exists.

## Phased Rollout

**Phase 1 — Structural evals.** No API cost. Fast. Catches gross regressions: skill stops producing diagrams, review loses two-pass structure, retro drops metrics table.

**Phase 2 — LLM-as-judge.** Catches behavioral regressions: plan-ceo-review stops challenging the literal request, review degrades into a style nitpick pass. Runs in CI when SKILL.md files change.

**Phase 3 — Integration with `/ship`.** Wire evals into step 4 so skill changes are gated by eval results before shipping.

## Why this matters

Skills are the product. The browse binary is plumbing. Today we test the plumbing and ship the product untested.

Every prompt engineer knows the failure mode: you iterate on a prompt, it improves for your test case, and silently regresses on three others. Evals are the unit tests of prompt engineering. Without them, skill quality is a coin flip on every SKILL.md edit.

## Open Questions

- Should scenarios use real codebases (snapshot repos) or synthetic project descriptions?
- What pass threshold makes sense for LLM-as-judge (determinism is imperfect)?
- Should eval results block merges, or just surface as warnings initially?
- Is there a preferred eval framework (Braintrust, promptfoo, custom) or should this be built from scratch with the Anthropic SDK?

Happy to implement this if the direction looks right. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Automated Evals for Skills #24

Proposal: Automated Evals for Skills

Problem

Proposed Solution

Architecture

1. Scenario files

2. Expected behaviors per skill

3. Two evaluation methods, used together

Example: `/plan-ceo-review` eval

Example: `/review` eval

Example: `/retro` eval

Proposed directory structure

Proposed CLI

Phased Rollout

Why this matters

Open Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Skill	Key Expected Behaviors
`/plan-ceo-review`	Challenges the literal request. Proposes scope expansion. Asks "what is this product actually for?" Finds the 10-star version. Produces delight opportunities.
`/plan-eng-review`	Produces architecture diagrams. Identifies failure modes. Defines system boundaries. Outputs test matrix. Respects user's chosen scope after Step 0.
`/review`	Two-pass structure (critical vs informational). Catches race conditions, trust boundaries, N+1 queries. Uses AskUserQuestion for critical findings. Does not nitpick style.
`/ship`	Non-interactive except for MINOR/MAJOR bumps. Runs tests before push. Creates PR with summary. Follows correct step ordering.
`/browse`	Starts server if needed. Uses snapshot refs. Takes screenshots for visual verification. Checks console for errors.
`/retro`	Computes metrics from git history. Detects work sessions. Produces tweetable summary. Saves JSON snapshot for trend tracking.

Proposal: Automated Evals for Skills #24

Description

Proposal: Automated Evals for Skills

Problem

Proposed Solution

Architecture

1. Scenario files

2. Expected behaviors per skill

3. Two evaluation methods, used together

Example: /plan-ceo-review eval

Example: /review eval

Example: /retro eval

Proposed directory structure

Proposed CLI

Phased Rollout

Why this matters

Open Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Example: `/plan-ceo-review` eval

Example: `/review` eval

Example: `/retro` eval