-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
All calibration data (83+ validated dispatches) was collected with elevated thinking levels:
- Claude Code: high thinking
- Codex: extra high thinking
This is not captured anywhere in the estimation model. Thinking level affects:
- Cost per turn: High thinking uses significantly more tokens per turn (~2x for Opus)
- Turns needed: Higher thinking = fewer iterations (gets it right first time)
- Duration per turn: Higher thinking = longer turns (~2-3 min vs ~1-1.5 min at default)
- Quality: Consistently Q4-Q5 scores may partly be because of elevated thinking
Proposed Changes
1. New --thinking-level CLI parameter
ae estimate --thinking-level high "Add user authentication"
ae estimate --thinking-level default "Add user authentication" # lower cost, more turnsValues: default, high, extra-high
2. Thinking level modifier in PERT engine
| Thinking Level | Cost/Turn Modifier | Turns Modifier | Net Duration |
|---|---|---|---|
| Default | 1.0x | 1.0x | baseline |
| High | 1.5-2.0x | 0.7-0.8x | ~1.2-1.6x cost, ~0.8x duration |
| Extra High | 2.0-2.5x | 0.6-0.7x | ~1.5-1.8x cost, ~0.7x duration |
3. Document in calibration data
Note which thinking level each calibration entry was collected at. Current entries are all high/extra-high.
4. Default behavior
If --thinking-level is not specified, default to high (matches calibration baseline). Flag in output: "Estimated at [thinking level] — calibration baseline."
Context
The Iris /estimate skill has been updated to document this gap. The agents.md memory file now records default thinking levels per agent. This issue tracks the product-side implementation.
Priority
P2 — important for accuracy but existing estimates are implicitly correct for high/extra-high users (which is our primary use case). Becomes P1 if/when external users with different thinking configurations report estimate drift.