Skip to content
Stas edited this page Mar 7, 2026 · 2 revisions

The skill includes evaluation prompts per the Claude Skills 2.0 framework.

Why Evals Matter

Evals detect when formula changes or model updates cause estimate drift. They are descriptions of desired behaviors, not just test cases.

Available Evals

eval-quick.md

Tests: Quick path produces valid output with minimal input.

  • Does not ask more than 4 questions
  • Auto-assigns task type
  • Produces one-line summary first
  • Includes PERT expected and committed values
  • Shows confidence bands

eval-hybrid.md

Tests: Detailed path with multi-team, confidence levels, org overhead.

  • All 13 questions handled correctly
  • Multi-human and multi-agent scaling applied
  • Org overhead on human time only
  • Cone of uncertainty spread applied
  • Correct confidence multiplier (90% = 1.8×)

eval-batch.md

Tests: Batch mode with mixed types and dependencies.

  • Processes 7 tasks in batch
  • Auto-assigns task types (infrastructure, coding)
  • Respects dependencies
  • Identifies critical path
  • Summary table + rollup + warnings

eval-regression.md

Tests: 6 known-good baselines for drift detection.

  • Trivial S task
  • Medium coding task
  • Large data-migration task
  • XL decomposition warning
  • Batch consistency
  • Confidence level comparison (50% vs 90%)

When to Run

Run evals after any change to:

  • formulas.md (lookup tables, multipliers)
  • frameworks.md (round ranges, effectiveness values)
  • SKILL.md (workflow changes)
  • Model version updates

How to Run

Paste the eval prompt into your AI coding client and compare the output against the expected behaviors listed in the eval file.

A regression is any output that:

  • Falls outside the expected ranges by >50%
  • Missing required output fields
  • Changes complexity assignment from baseline
  • Fails to trigger expected warnings

Full eval files: evals/

Clone this wiki locally