Skip to content

Calibrate formula parameters against 110k data points#5

Merged
Enreign merged 2 commits intomainfrom
feature/v0.5-deep-validation-calibration
Mar 12, 2026
Merged

Calibrate formula parameters against 110k data points#5
Enreign merged 2 commits intomainfrom
feature/v0.5-deep-validation-calibration

Conversation

@Enreign
Copy link
Copy Markdown
Owner

@Enreign Enreign commented Mar 12, 2026

Summary

  • Deep statistical validation of all formula parameters using 110k+ data points across 13 datasets and 11 analyses
  • 5 parameter changes backed by bootstrap CIs (n=1000, seed=42) and KS goodness-of-fit tests
  • 19 new tests encoding research-backed properties (63 total, all pass)

Parameter Changes

# Parameter Before After Evidence
1 90% confidence multiplier Flat 1.8x Size-dependent 2.0–2.9x 84k estimate-actual pairs — 1.8x captured only 80-88%, not 90%
2 Agent effectiveness M/L 0.7 / 0.5 0.5 / 0.35 24k METR runs (Jan 2025–Feb 2026) — autonomous rates: M=0.25, L=0.15
3 PERT distribution Beta (arithmetic midpoint) Log-normal (geometric mean) KS test — log-normal wins all 4 size bands (n=84k)
4 Review minutes S:30, M:60, L:120, XL:240 S:20, M:30, L:60, XL:120 Project-22 reviews (n=290) — 17-20 min medians
5 Output token ratio S/M 0.25 / 0.28 0.30 / 0.30 Aider leaderboard (n=23) — 0.31 median excl. reasoning tokens

New Files

  • tests/deep_validation.py — 11-analysis validation script with Parameter Audit Card
  • datasets/download_benchmarks.sh — downloads METR, OpenHands, Aider data
  • datasets/benchmarks/tokenomics.csv — per-phase token distribution (arXiv 2601.14470)
  • datasets/benchmarks/onprem-tokens.csv — coding task token estimates
  • docs/deep-validation-findings.md — combined findings with rationale and data source links

Full Audit Results

53 parameters audited: 22 ADJUST, 7 FLAG, 24 KEEP. Only the top 5 highest-confidence changes are applied in this PR. See docs/deep-validation-findings.md for the complete analysis.

Test plan

  • All 63 tests pass (python3 -m pytest tests/test_formulas.py -v)
  • Deep validation runs end-to-end (python3 tests/deep_validation.py)
  • New TestCase10 verifies research-backed parameter properties
  • New TestCase11 verifies log-normal PERT weighting behavior
  • New TestCase12 verifies parameter interaction sanity

🤖 Generated with Claude Code

Enreign and others added 2 commits March 12, 2026 00:28
Validates progressive-estimation formulas against CESAW (61k), SiP (12k),
Renzo (10k), Project-22, China, Kitchenham, NASA93, Desharnais, and more.
8 analysis sections: estimation accuracy, effort distributions, overrun
patterns, task types, review times, story points, contextual factors,
and personal estimation patterns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5 parameter changes backed by 11-analysis deep validation across 86k
estimation tasks + 24k METR agent runs + 93 Aider benchmark entries:

1. Confidence multipliers: flat → size-dependent (90%: 1.8x → 2.0-2.9x)
2. Agent effectiveness: M 0.7→0.5, L 0.5→0.35 (METR time horizons)
3. PERT assumption: beta → log-normal (KS test, all 4 bands)
4. Review minutes: halved across all depths (17-20 min median in data)
5. Output token ratio: S 0.25→0.30, M 0.28→0.30 (Aider 0.31 median)

Adds deep_validation.py (11 analyses), benchmark download tooling,
bundled token datasets, and findings summary document.

63 tests pass (44 updated + 19 new validation tests).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Enreign Enreign merged commit d08a513 into main Mar 12, 2026
3 checks passed
@Enreign Enreign deleted the feature/v0.5-deep-validation-calibration branch March 12, 2026 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant