Releases: Enreign/progressive-estimation
v0.5.0 — Deep Validation Calibration
What's New
Deep Statistical Validation
All formula parameters validated against 110k+ data points across 13 datasets using bootstrap confidence intervals and KS goodness-of-fit tests. 53 parameters audited across 11 analyses — 22 adjusted, 7 flagged, 24 confirmed.
Full findings: docs/deep-validation-findings.md
5 Research-Backed Parameter Changes
1. Confidence multipliers → size-dependent
Old: flat 1.0x / 1.4x / 1.8x for all sizes.
New: size-dependent (e.g., 90% S=2.9x, M=2.1x, L=2.0x, XL=2.2x).
Evidence: 84k estimate-actual pairs. The old 1.8x captured only 80-88% of tasks, not 90%.
2. Agent effectiveness: M and L lowered
Old: M=0.7, L=0.5. New: M=0.5, L=0.35.
Evidence: METR Time Horizons — 24k runs, 228 tasks (Jan 2025–Feb 2026). Autonomous completion: M=0.25, L=0.15. Our values sit between METR rates and 1.0 (partial-credit acceleration).
3. PERT → log-normal weighting
Old: (min + 4×midpoint + max) / 6 (beta assumption).
New: (min + 4×geometric_mean + max) / 6 (log-normal).
Evidence: KS goodness-of-fit test (n=84k) — log-normal beats beta in all 4 size bands. Software effort has heavier right tails than beta captures.
4. Review minutes halved
Old: S=30, M=60, L=120, XL=240 (standard depth).
New: S=20, M=30, L=60, XL=120.
Evidence: Project-22 reviews (n=290 joined) — 17-20 min median regardless of size. Conservative reduction given this is human-code review, not AI-generated code review.
5. Output token ratio: S/M raised
Old: S=0.25, M=0.28. New: S=0.30, M=0.30.
Evidence: Aider leaderboard (n=23 models) — 0.31 median output ratio (excluding reasoning tokens).
New Validation Infrastructure
tests/deep_validation.py— 11-analysis script producing Parameter Audit Cards with bootstrap CIsdatasets/download_benchmarks.sh— downloads METR, OpenHands, Aider benchmark data- Bundled datasets: tokenomics.csv (arXiv 2601.14470), onprem-tokens.csv
- 19 new tests (63 total) encoding research-backed properties
Data Sources
| Dataset | Size | What It Validates |
|---|---|---|
| CESAW | 61k tasks | Distribution fitting, confidence multipliers, base rounds |
| SiP | 12k tasks | Task type multipliers, confidence multipliers |
| Project-22 | 616 stories + 1441 reviews | Review times, human fix ratio |
| Renzo Pomodoro | 10k tasks | Distribution fitting, confidence multipliers |
| METR Time Horizons | 24k runs / 228 tasks | Agent effectiveness by size |
| Aider Leaderboard | ~50 models / 93 entries | Output token ratio, cost model |
| Tokenomics (ChatDev) | 30 tasks / 6 phases | Per-phase token distribution |
Version
0.4.0 → 0.5.0
Files Changed
| File | Changes |
|---|---|
references/formulas.md |
5 parameter updates, log-normal PERT, size-dependent multipliers |
tests/test_formulas.py |
Updated constants + 19 new validation tests (63 total) |
tests/deep_validation.py |
New — 11-analysis deep validation script |
datasets/download_benchmarks.sh |
New — benchmark data download script |
datasets/benchmarks/*.csv |
New — bundled token datasets |
docs/deep-validation-findings.md |
New — combined findings document |
datasets/README.md |
Added benchmark download instructions |
.gitignore |
Added benchmark data exclusions |
SKILL.md |
Version bump, updated descriptions |
README.md |
Updated parameters, research sources, file structure |
evals/eval-regression.md |
Updated baselines for new values |
v0.4.0 — Token Estimation, 4 New PM Tools
What's New
Token & Cost Estimation (Step 15)
- Computes per-task token consumption based on complexity × maturity lookup tables
- Splits into input/output tokens using complexity-specific output ratios
- PERT expected tokens with min/max range
- Optional API cost estimation across three model tiers:
- Economy — Haiku, GPT-4o Mini, Gemini Flash
- Standard — Sonnet, GPT-4o, Gemini 2.5 Pro
- Premium — Opus, GPT-5
- Includes per-model pricing reference table (last verified March 2026)
- Tokens appear in one-line summary:
10-26 agent rounds (~180k tokens) - Token Estimate and Model Tier rows added to breakdown table
- Cost only shown when
show_cost=true(off by default)
4 New PM Tool Integrations
- Asana — custom fields for sizing, risk, agent rounds, tokens; subtasks native
- Azure DevOps — tags for size/risk, native Original Estimate, HTML table in Description
- Zenhub — GitHub labels + body sections, Zenhub Estimate for points
- Shortcut — labels for size/risk, custom fields on Team plan+, native Estimate for points
Total supported trackers: 10 (was 6)
New Inputs
model_tier— economy/standard/premium (default: standard)show_cost— boolean (default: false)- Question #14 added to detailed questionnaire path
Test Coverage
estimate_tokens()function with full Step 15 math- Token estimation integrated into main
estimate()pipeline - TestCase7 — exact token math, output ratios, cost arithmetic
- TestCase8 — complexity scaling, multi-agent multiplier, maturity variation, PERT cost formula, tier pricing comparison
- TestCase9 — integration:
token_estimatepresent inestimate()output, correct structure, rounds consistency,show_costtoggle - 44 tests total (was 25), all passing
- Original 6 test cases unchanged (backward compatible)
Version
0.3.0 → 0.4.0
Files Changed
| File | Changes |
|---|---|
references/formulas.md |
Token lookup tables, Step 15 formula, output schema fields |
tests/test_formulas.py |
estimate_tokens(), TestCase7-9, integration into estimate() |
references/questionnaire.md |
Question #14, mapping table update |
references/output-schema.md |
Token rows in templates, 4 new PM tool sections, batch Tokens column |
SKILL.md |
Tracker list, pipeline step, version bump |
README.md |
Feature list, tracker list |
v0.3.0 — Instant Mode, Cooperation Modes, Research Grounding
What's New
Three Estimation Modes
- Instant — zero questions, infer everything from the task description
- Quick — 4 questions with sensible defaults
- Detailed — 13 questions, full control over every parameter
Cooperation Mode Detection
Auto-detects your team's working style from intake answers:
- Human-only → story points for sizing and velocity
- Hybrid → dual-track: points for sizing, hours for planning
- Agent-first → plan by human review capacity in hours
Small Council Validation
Subagent perspectives (Optimist, Skeptic, Historian) review estimates for M+ tasks before output.
Sprint Velocity Tracking
Mode-aware velocity tracking with rolling averages, capacity planning, and divergence detection for hybrid teams.
Research Grounding
Formulas now cite 15+ empirical studies:
- METR Time Horizons (R²=0.83, ~230 tasks) — agent effectiveness decay
- METR RCT (16 devs, 246 tasks) — productivity measurement
- Faros AI (10k devs, 1.2k teams) — review bottleneck validation
- PairCoder (400+ tasks) — human time dominance confirmation
- Agent effectiveness clarified as work acceleration, not autonomous completion rate
- Bug-fix multiplier bumped 1.2x → 1.3x based on empirical data
GitHub Action
Auto-estimate issues labeled needs-estimate with PERT formulas.
AskUserQuestion Integration
Structured dropdowns for all user interactions when the tool is available.
Full Changelog
v0.2.0 — Multi-skill repo structure
Agent Skills v0.2.0
Breaking changes
- Repo renamed from
progressive-estimationtoagent-skills(GitHub redirects old URLs) - Skill files moved to
skills/progressive-estimation/for multi-skill support
What's new
- Multi-skill repo structure — ready for additional skills under
skills/ - New README — landing page listing all available skills
- Updated install commands:
# Via skills.sh npx skills add Enreign/agent-skills # Manual (Claude Code) git clone https://github.com/Enreign/agent-skills.git ~/.claude/skills/agent-skills
Skills included
- progressive-estimation — PERT estimation for AI-assisted development work
⚠️ Early development — formulas, multipliers, and defaults may change between versions.
v0.1.1 — skills.sh compatibility + structured intake
Progressive Estimation v0.1.1
What's new
- skills.sh compatibility — install with
npx skills add Enreign/progressive-estimation - YAML frontmatter in SKILL.md for automatic skill discovery
- Structured intake — uses
AskUserQuestiontool when available for dropdown-style questionnaire instead of free-text back-and-forth - Updated banner — new Estimation artwork
Install
# Via skills.sh (recommended)
npx skills add Enreign/progressive-estimation
# Manual (Claude Code)
git clone https://github.com/Enreign/progressive-estimation.git ~/.claude/skills/progressive-estimationOr download the .skill archive from this release.
⚠️ Early development — formulas, multipliers, and defaults may change between versions.
v0.1.0 — Initial Public Release
Progressive Estimation v0.1.0
First public release of the Progressive Estimation skill for AI coding assistants.
What's included
- PERT three-point estimation with confidence bands (68%, 95%)
- Four workflow paths: quick/detailed × single/batch
- Agent effectiveness decay modeling based on METR research
- Calibration system with PRED(25) tracking
- Tracker output for Linear, JIRA, ClickUp, GitHub Issues, Monday, and GitLab
- Anti-pattern guards for common estimation mistakes
- Eval suite for regression, batch, quick, and hybrid scenarios
Install
git clone https://github.com/Enreign/progressive-estimation.git ~/.claude/skills/progressive-estimationOr download the .skill archive from this release.
⚠️ Early development — formulas, multipliers, and defaults may change between versions.