Skip to content

Global Tests & Evals for whole system #25

@dannysmith

Description

@dannysmith

No way to evaluate whether the Claude Code skill (and other plugin components like commands) actually work well. Ideally there'd be an eval-like process: generate test vaults, define expected behaviours for various scenarios, run Claude Code with the skill, and have another instance evaluate the results against expectations.## 1. Claude Skill Effectiveness (Agent Evals)

Thoughts from Claude as a starting point...

This is the most complex challenge. The research points to a two-layer approach:

Layer 1: Reasoning evaluation - Did Claude choose the right tool/command?
Layer 2: Action evaluation - Did it accomplish the goal?

Key metrics from Anthropic's approach:

  • pass@k: Probability of success in at least one of k attempts (capability)
  • pass^k: Probability of success in ALL k attempts (consistency)

These diverge significantly - a skill might pass@5 = 95% but pass^5 = 40%. This tells you whether the skill can work vs whether it reliably works.

Practical approach:

Test Suite Structure:
├── scenarios/
│   ├── add-task-to-project.yaml
│   ├── weekly-review-flow.yaml
│   └── find-overdue-tasks.yaml
├── vaults/
│   ├── empty-vault/
│   ├── busy-freelancer/
│   └── minimal-setup/
└── expected/
  └── (criteria for each scenario)

Each scenario would have:

  1. Initial vault state (reference to a test vault)
  2. User prompt (what the user asks)
  3. Success criteria (what should happen, not deterministic output)
  4. Evaluation method (LLM-as-judge with binary pass/fail)

Running the scenarios is the hard part. Options:

  1. Manual runs + LLM evaluation: Run the scenario manually, paste transcript, have an evaluator LLM score it
  2. API-based simulation: Call the Anthropic API with your skill as system prompt, simulate the conversation
  3. Programmatic Claude Code invocation: If/when Claude Code exposes a programmatic interface

The research strongly recommends binary scoring (pass/fail) over complex rubrics, and asking the judge to explain reasoning before scoring to improve reliability.

Starting point: Create 10-20 scenarios that represent common workflows, run them manually when you change the skill, have Claude (in a separate window) evaluate the transcript.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions