Calibrate formula parameters against 110k data points#5
Merged
Conversation
Validates progressive-estimation formulas against CESAW (61k), SiP (12k), Renzo (10k), Project-22, China, Kitchenham, NASA93, Desharnais, and more. 8 analysis sections: estimation accuracy, effort distributions, overrun patterns, task types, review times, story points, contextual factors, and personal estimation patterns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5 parameter changes backed by 11-analysis deep validation across 86k estimation tasks + 24k METR agent runs + 93 Aider benchmark entries: 1. Confidence multipliers: flat → size-dependent (90%: 1.8x → 2.0-2.9x) 2. Agent effectiveness: M 0.7→0.5, L 0.5→0.35 (METR time horizons) 3. PERT assumption: beta → log-normal (KS test, all 4 bands) 4. Review minutes: halved across all depths (17-20 min median in data) 5. Output token ratio: S 0.25→0.30, M 0.28→0.30 (Aider 0.31 median) Adds deep_validation.py (11 analyses), benchmark download tooling, bundled token datasets, and findings summary document. 63 tests pass (44 updated + 19 new validation tests). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Parameter Changes
New Files
tests/deep_validation.py— 11-analysis validation script with Parameter Audit Carddatasets/download_benchmarks.sh— downloads METR, OpenHands, Aider datadatasets/benchmarks/tokenomics.csv— per-phase token distribution (arXiv 2601.14470)datasets/benchmarks/onprem-tokens.csv— coding task token estimatesdocs/deep-validation-findings.md— combined findings with rationale and data source linksFull Audit Results
53 parameters audited: 22 ADJUST, 7 FLAG, 24 KEEP. Only the top 5 highest-confidence changes are applied in this PR. See
docs/deep-validation-findings.mdfor the complete analysis.Test plan
python3 -m pytest tests/test_formulas.py -v)python3 tests/deep_validation.py)🤖 Generated with Claude Code