Global Tests & Evals for whole system

> No way to evaluate whether the Claude Code skill (and other plugin components like commands) actually work well. Ideally there'd be an eval-like process: generate test vaults, define expected behaviours for various scenarios, run Claude Code with the skill, and have another instance evaluate the results against expectations.## 1. Claude Skill Effectiveness (Agent Evals)

_Thoughts from Claude as a starting point..._

This is the most complex challenge. The research points to a two-layer approach:

Layer 1: Reasoning evaluation - Did Claude choose the right tool/command?
Layer 2: Action evaluation - Did it accomplish the goal?

Key metrics from Anthropic's approach:
- pass@k: Probability of success in at least one of k attempts (capability)
- pass^k: Probability of success in ALL k attempts (consistency)

These diverge significantly - a skill might pass@5 = 95% but pass^5 = 40%. This tells you whether the skill can work vs whether it reliably works.

Practical approach:

```
Test Suite Structure:
├── scenarios/
│   ├── add-task-to-project.yaml
│   ├── weekly-review-flow.yaml
│   └── find-overdue-tasks.yaml
├── vaults/
│   ├── empty-vault/
│   ├── busy-freelancer/
│   └── minimal-setup/
└── expected/
  └── (criteria for each scenario)
```

Each scenario would have:

1. Initial vault state (reference to a test vault)
2. User prompt (what the user asks)
3. Success criteria (what should happen, not deterministic output)
4. Evaluation method (LLM-as-judge with binary pass/fail)

  Running the scenarios is the hard part. Options:

1. Manual runs + LLM evaluation: Run the scenario manually, paste transcript, have an evaluator LLM score it
2. API-based simulation: Call the Anthropic API with your skill as system prompt, simulate the conversation
3. Programmatic Claude Code invocation: If/when Claude Code exposes a programmatic interface

The research strongly recommends binary scoring (pass/fail) over complex rubrics, and asking the judge to explain reasoning before scoring to improve reliability.

Starting point: Create 10-20 scenarios that represent common workflows, run them manually when you change the skill, have Claude (in a separate window) evaluate the transcript.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global Tests & Evals for whole system #25

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Global Tests & Evals for whole system #25

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions