-
Notifications
You must be signed in to change notification settings - Fork 0
Evals
Stas edited this page Mar 7, 2026
·
2 revisions
The skill includes evaluation prompts per the Claude Skills 2.0 framework.
Evals detect when formula changes or model updates cause estimate drift. They are descriptions of desired behaviors, not just test cases.
Tests: Quick path produces valid output with minimal input.
- Does not ask more than 4 questions
- Auto-assigns task type
- Produces one-line summary first
- Includes PERT expected and committed values
- Shows confidence bands
Tests: Detailed path with multi-team, confidence levels, org overhead.
- All 13 questions handled correctly
- Multi-human and multi-agent scaling applied
- Org overhead on human time only
- Cone of uncertainty spread applied
- Correct confidence multiplier (90% = 1.8×)
Tests: Batch mode with mixed types and dependencies.
- Processes 7 tasks in batch
- Auto-assigns task types (infrastructure, coding)
- Respects dependencies
- Identifies critical path
- Summary table + rollup + warnings
Tests: 6 known-good baselines for drift detection.
- Trivial S task
- Medium coding task
- Large data-migration task
- XL decomposition warning
- Batch consistency
- Confidence level comparison (50% vs 90%)
Run evals after any change to:
-
formulas.md(lookup tables, multipliers) -
frameworks.md(round ranges, effectiveness values) -
SKILL.md(workflow changes) - Model version updates
Paste the eval prompt into your AI coding client and compare the output against the expected behaviors listed in the eval file.
A regression is any output that:
- Falls outside the expected ranges by >50%
- Missing required output fields
- Changes complexity assignment from baseline
- Fails to trigger expected warnings
Full eval files: evals/
Getting Started
Core Concepts
- How It Works
- Task Types
- Agent Effectiveness
- Confidence Levels
- Cone of Uncertainty
- PERT Statistics
- Small Council
Reference
Accuracy
Contributors