Skip to content

Deep Validation

Enreign edited this page Mar 12, 2026 · 1 revision

All formula parameters have been validated against 110k+ data points across 13 datasets using bootstrap confidence intervals (n=1000, seed=42) and KS goodness-of-fit tests.

Summary

53 parameters audited across 11 analyses. 22 adjusted, 7 flagged, 24 confirmed.

The 11 Analyses

# Analysis Data What It Tests
1 Distribution Fitting CESAW + SiP + Renzo (84k) PERT-beta vs log-normal vs normal
2 Optimal Confidence Multipliers 84k estimate-actual pairs Size-dependent 50%/80%/90% multipliers
3 Review Time Regression Project-22 (290 reviews) Review minutes by size band
4 Task Type Multipliers SiP (12k) + CESAW (61k) Overrun ratios by task category
5 Human Fix Ratio Proxy Project-22 (1156 branches) Rework from review cycles
6 Base Rounds Calibration 37k tasks Implied rounds from actual effort
7 Sensitivity Analysis Formula-only Tornado chart per complexity
8 Agent Effectiveness METR (24k runs) Success rate by task duration
9 Token Consumption Aider + Tokenomics + onprem Tokens per case and phase distribution
10 Output Token Ratio Aider (23) + Tokenomics (6) Output/total token split
11 Cost Model Aider (93 entries) Cost per case by model tier

Parameter Changes in v0.5.0

Parameter Before After Evidence
90% confidence multiplier Flat 1.8× Size-dependent 2.0–2.9× 84k pairs — 1.8× captured 80-88%, not 90%
Agent effectiveness M/L 0.7 / 0.5 0.5 / 0.35 24k METR runs — autonomous rates M=0.25, L=0.15
PERT distribution Beta (arithmetic midpoint) Log-normal (geometric mean) KS test — log-normal wins all 4 bands
Review minutes S:30, M:60, L:120, XL:240 S:20, M:30, L:60, XL:120 Project-22 (n=290) — 17-20 min medians
Output token ratio S/M 0.25 / 0.28 0.30 / 0.30 Aider (n=23) — 0.31 median

Data Sources

Dataset Size Source
CESAW 61,514 tasks Derek Jones
SiP 12,100 tasks Derek Jones
Renzo Pomodoro 10,464 tasks Derek Jones
Project-22 616 stories + 1,441 reviews Derek Jones
METR Time Horizons 24,008 runs / 228 tasks METR
Aider Leaderboard ~50 models / 93 entries Aider
Tokenomics (ChatDev) 30 tasks / 6 phases arXiv 2601.14470
onprem.ai 10 coding tasks Bundled in repo

Reproducing

# Install datasets
git clone https://github.com/Derek-Jones/Software-estimation-datasets.git datasets/derek-jones
bash datasets/download_benchmarks.sh

# Run all 11 analyses
python3 tests/deep_validation.py

# Run individual analysis
python3 tests/deep_validation.py --analysis distribution
python3 tests/deep_validation.py --list

Full Report

See docs/deep-validation-findings.md for the complete findings with bootstrap CIs, rationale, and caveats for each parameter.

Clone this wiki locally