Deep Validation

All formula parameters have been validated against 110k+ data points across 13 datasets using bootstrap confidence intervals (n=1000, seed=42) and KS goodness-of-fit tests.

Summary

53 parameters audited across 11 analyses. 22 adjusted, 7 flagged, 24 confirmed.

The 11 Analyses

#	Analysis	Data	What It Tests
1	Distribution Fitting	CESAW + SiP + Renzo (84k)	PERT-beta vs log-normal vs normal
2	Optimal Confidence Multipliers	84k estimate-actual pairs	Size-dependent 50%/80%/90% multipliers
3	Review Time Regression	Project-22 (290 reviews)	Review minutes by size band
4	Task Type Multipliers	SiP (12k) + CESAW (61k)	Overrun ratios by task category
5	Human Fix Ratio Proxy	Project-22 (1156 branches)	Rework from review cycles
6	Base Rounds Calibration	37k tasks	Implied rounds from actual effort
7	Sensitivity Analysis	Formula-only	Tornado chart per complexity
8	Agent Effectiveness	METR (24k runs)	Success rate by task duration
9	Token Consumption	Aider + Tokenomics + onprem	Tokens per case and phase distribution
10	Output Token Ratio	Aider (23) + Tokenomics (6)	Output/total token split
11	Cost Model	Aider (93 entries)	Cost per case by model tier

Parameter Changes in v0.5.0

Parameter	Before	After	Evidence
90% confidence multiplier	Flat 1.8×	Size-dependent 2.0–2.9×	84k pairs — 1.8× captured 80-88%, not 90%
Agent effectiveness M/L	0.7 / 0.5	0.5 / 0.35	24k METR runs — autonomous rates M=0.25, L=0.15
PERT distribution	Beta (arithmetic midpoint)	Log-normal (geometric mean)	KS test — log-normal wins all 4 bands
Review minutes	S:30, M:60, L:120, XL:240	S:20, M:30, L:60, XL:120	Project-22 (n=290) — 17-20 min medians
Output token ratio S/M	0.25 / 0.28	0.30 / 0.30	Aider (n=23) — 0.31 median

Data Sources

Dataset	Size	Source
CESAW	61,514 tasks	Derek Jones
SiP	12,100 tasks	Derek Jones
Renzo Pomodoro	10,464 tasks	Derek Jones
Project-22	616 stories + 1,441 reviews	Derek Jones
METR Time Horizons	24,008 runs / 228 tasks	METR
Aider Leaderboard	~50 models / 93 entries	Aider
Tokenomics (ChatDev)	30 tasks / 6 phases	arXiv 2601.14470
onprem.ai	10 coding tasks	Bundled in repo

Reproducing

# Install datasets
git clone https://github.com/Derek-Jones/Software-estimation-datasets.git datasets/derek-jones
bash datasets/download_benchmarks.sh

# Run all 11 analyses
python3 tests/deep_validation.py

# Run individual analysis
python3 tests/deep_validation.py --analysis distribution
python3 tests/deep_validation.py --list

Full Report

See docs/deep-validation-findings.md for the complete findings with bootstrap CIs, rationale, and caveats for each parameter.

Getting Started

Core Concepts

Reference

Accuracy

Contributors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deep Validation

Summary

The 11 Analyses

Parameter Changes in v0.5.0

Data Sources

Reproducing

Full Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally