-
Notifications
You must be signed in to change notification settings - Fork 0
Deep Validation
Enreign edited this page Mar 12, 2026
·
1 revision
All formula parameters have been validated against 110k+ data points across 13 datasets using bootstrap confidence intervals (n=1000, seed=42) and KS goodness-of-fit tests.
53 parameters audited across 11 analyses. 22 adjusted, 7 flagged, 24 confirmed.
| # | Analysis | Data | What It Tests |
|---|---|---|---|
| 1 | Distribution Fitting | CESAW + SiP + Renzo (84k) | PERT-beta vs log-normal vs normal |
| 2 | Optimal Confidence Multipliers | 84k estimate-actual pairs | Size-dependent 50%/80%/90% multipliers |
| 3 | Review Time Regression | Project-22 (290 reviews) | Review minutes by size band |
| 4 | Task Type Multipliers | SiP (12k) + CESAW (61k) | Overrun ratios by task category |
| 5 | Human Fix Ratio Proxy | Project-22 (1156 branches) | Rework from review cycles |
| 6 | Base Rounds Calibration | 37k tasks | Implied rounds from actual effort |
| 7 | Sensitivity Analysis | Formula-only | Tornado chart per complexity |
| 8 | Agent Effectiveness | METR (24k runs) | Success rate by task duration |
| 9 | Token Consumption | Aider + Tokenomics + onprem | Tokens per case and phase distribution |
| 10 | Output Token Ratio | Aider (23) + Tokenomics (6) | Output/total token split |
| 11 | Cost Model | Aider (93 entries) | Cost per case by model tier |
| Parameter | Before | After | Evidence |
|---|---|---|---|
| 90% confidence multiplier | Flat 1.8× | Size-dependent 2.0–2.9× | 84k pairs — 1.8× captured 80-88%, not 90% |
| Agent effectiveness M/L | 0.7 / 0.5 | 0.5 / 0.35 | 24k METR runs — autonomous rates M=0.25, L=0.15 |
| PERT distribution | Beta (arithmetic midpoint) | Log-normal (geometric mean) | KS test — log-normal wins all 4 bands |
| Review minutes | S:30, M:60, L:120, XL:240 | S:20, M:30, L:60, XL:120 | Project-22 (n=290) — 17-20 min medians |
| Output token ratio S/M | 0.25 / 0.28 | 0.30 / 0.30 | Aider (n=23) — 0.31 median |
| Dataset | Size | Source |
|---|---|---|
| CESAW | 61,514 tasks | Derek Jones |
| SiP | 12,100 tasks | Derek Jones |
| Renzo Pomodoro | 10,464 tasks | Derek Jones |
| Project-22 | 616 stories + 1,441 reviews | Derek Jones |
| METR Time Horizons | 24,008 runs / 228 tasks | METR |
| Aider Leaderboard | ~50 models / 93 entries | Aider |
| Tokenomics (ChatDev) | 30 tasks / 6 phases | arXiv 2601.14470 |
| onprem.ai | 10 coding tasks | Bundled in repo |
# Install datasets
git clone https://github.com/Derek-Jones/Software-estimation-datasets.git datasets/derek-jones
bash datasets/download_benchmarks.sh
# Run all 11 analyses
python3 tests/deep_validation.py
# Run individual analysis
python3 tests/deep_validation.py --analysis distribution
python3 tests/deep_validation.py --listSee docs/deep-validation-findings.md for the complete findings with bootstrap CIs, rationale, and caveats for each parameter.
Getting Started
Core Concepts
- How It Works
- Task Types
- Agent Effectiveness
- Confidence Levels
- Cone of Uncertainty
- PERT Statistics
- Small Council
Reference
Accuracy
Contributors