-
Notifications
You must be signed in to change notification settings - Fork 0
Description
No way to evaluate whether the Claude Code skill (and other plugin components like commands) actually work well. Ideally there'd be an eval-like process: generate test vaults, define expected behaviours for various scenarios, run Claude Code with the skill, and have another instance evaluate the results against expectations.## 1. Claude Skill Effectiveness (Agent Evals)
Thoughts from Claude as a starting point...
This is the most complex challenge. The research points to a two-layer approach:
Layer 1: Reasoning evaluation - Did Claude choose the right tool/command?
Layer 2: Action evaluation - Did it accomplish the goal?
Key metrics from Anthropic's approach:
- pass@k: Probability of success in at least one of k attempts (capability)
- pass^k: Probability of success in ALL k attempts (consistency)
These diverge significantly - a skill might pass@5 = 95% but pass^5 = 40%. This tells you whether the skill can work vs whether it reliably works.
Practical approach:
Test Suite Structure:
├── scenarios/
│ ├── add-task-to-project.yaml
│ ├── weekly-review-flow.yaml
│ └── find-overdue-tasks.yaml
├── vaults/
│ ├── empty-vault/
│ ├── busy-freelancer/
│ └── minimal-setup/
└── expected/
└── (criteria for each scenario)
Each scenario would have:
- Initial vault state (reference to a test vault)
- User prompt (what the user asks)
- Success criteria (what should happen, not deterministic output)
- Evaluation method (LLM-as-judge with binary pass/fail)
Running the scenarios is the hard part. Options:
- Manual runs + LLM evaluation: Run the scenario manually, paste transcript, have an evaluator LLM score it
- API-based simulation: Call the Anthropic API with your skill as system prompt, simulate the conversation
- Programmatic Claude Code invocation: If/when Claude Code exposes a programmatic interface
The research strongly recommends binary scoring (pass/fail) over complex rubrics, and asking the judge to explain reasoning before scoring to improve reliability.
Starting point: Create 10-20 scenarios that represent common workflows, run them manually when you change the skill, have Claude (in a separate window) evaluate the transcript.