Calibrate formula parameters against 110k data points by Enreign · Pull Request #5 · Enreign/progressive-estimation

Enreign · 2026-03-12T11:30:11Z

Summary

Deep statistical validation of all formula parameters using 110k+ data points across 13 datasets and 11 analyses
5 parameter changes backed by bootstrap CIs (n=1000, seed=42) and KS goodness-of-fit tests
19 new tests encoding research-backed properties (63 total, all pass)

Parameter Changes

#	Parameter	Before	After	Evidence
1	90% confidence multiplier	Flat 1.8x	Size-dependent 2.0–2.9x	84k estimate-actual pairs — 1.8x captured only 80-88%, not 90%
2	Agent effectiveness M/L	0.7 / 0.5	0.5 / 0.35	24k METR runs (Jan 2025–Feb 2026) — autonomous rates: M=0.25, L=0.15
3	PERT distribution	Beta (arithmetic midpoint)	Log-normal (geometric mean)	KS test — log-normal wins all 4 size bands (n=84k)
4	Review minutes	S:30, M:60, L:120, XL:240	S:20, M:30, L:60, XL:120	Project-22 reviews (n=290) — 17-20 min medians
5	Output token ratio S/M	0.25 / 0.28	0.30 / 0.30	Aider leaderboard (n=23) — 0.31 median excl. reasoning tokens

New Files

tests/deep_validation.py — 11-analysis validation script with Parameter Audit Card
datasets/download_benchmarks.sh — downloads METR, OpenHands, Aider data
datasets/benchmarks/tokenomics.csv — per-phase token distribution (arXiv 2601.14470)
datasets/benchmarks/onprem-tokens.csv — coding task token estimates
docs/deep-validation-findings.md — combined findings with rationale and data source links

Full Audit Results

53 parameters audited: 22 ADJUST, 7 FLAG, 24 KEEP. Only the top 5 highest-confidence changes are applied in this PR. See docs/deep-validation-findings.md for the complete analysis.

Test plan

All 63 tests pass (python3 -m pytest tests/test_formulas.py -v)
Deep validation runs end-to-end (python3 tests/deep_validation.py)
New TestCase10 verifies research-backed parameter properties
New TestCase11 verifies log-normal PERT weighting behavior
New TestCase12 verifies parameter interaction sanity

🤖 Generated with Claude Code

Validates progressive-estimation formulas against CESAW (61k), SiP (12k), Renzo (10k), Project-22, China, Kitchenham, NASA93, Desharnais, and more. 8 analysis sections: estimation accuracy, effort distributions, overrun patterns, task types, review times, story points, contextual factors, and personal estimation patterns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

5 parameter changes backed by 11-analysis deep validation across 86k estimation tasks + 24k METR agent runs + 93 Aider benchmark entries: 1. Confidence multipliers: flat → size-dependent (90%: 1.8x → 2.0-2.9x) 2. Agent effectiveness: M 0.7→0.5, L 0.5→0.35 (METR time horizons) 3. PERT assumption: beta → log-normal (KS test, all 4 bands) 4. Review minutes: halved across all depths (17-20 min median in data) 5. Output token ratio: S 0.25→0.30, M 0.28→0.30 (Aider 0.31 median) Adds deep_validation.py (11 analyses), benchmark download tooling, bundled token datasets, and findings summary document. 63 tests pass (44 updated + 19 new validation tests). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Enreign and others added 2 commits March 12, 2026 00:28

Enreign merged commit d08a513 into main Mar 12, 2026
3 checks passed

Enreign deleted the feature/v0.5-deep-validation-calibration branch March 12, 2026 11:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calibrate formula parameters against 110k data points#5

Calibrate formula parameters against 110k data points#5
Enreign merged 2 commits intomainfrom
feature/v0.5-deep-validation-calibration

Enreign commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Enreign commented Mar 12, 2026

Summary

Parameter Changes

New Files

Full Audit Results

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant