Skip to content

Releases: Enreign/progressive-estimation

v0.5.0 — Deep Validation Calibration

12 Mar 11:35

Choose a tag to compare

What's New

Deep Statistical Validation

All formula parameters validated against 110k+ data points across 13 datasets using bootstrap confidence intervals and KS goodness-of-fit tests. 53 parameters audited across 11 analyses — 22 adjusted, 7 flagged, 24 confirmed.

Full findings: docs/deep-validation-findings.md

5 Research-Backed Parameter Changes

1. Confidence multipliers → size-dependent
Old: flat 1.0x / 1.4x / 1.8x for all sizes.
New: size-dependent (e.g., 90% S=2.9x, M=2.1x, L=2.0x, XL=2.2x).
Evidence: 84k estimate-actual pairs. The old 1.8x captured only 80-88% of tasks, not 90%.

2. Agent effectiveness: M and L lowered
Old: M=0.7, L=0.5. New: M=0.5, L=0.35.
Evidence: METR Time Horizons — 24k runs, 228 tasks (Jan 2025–Feb 2026). Autonomous completion: M=0.25, L=0.15. Our values sit between METR rates and 1.0 (partial-credit acceleration).

3. PERT → log-normal weighting
Old: (min + 4×midpoint + max) / 6 (beta assumption).
New: (min + 4×geometric_mean + max) / 6 (log-normal).
Evidence: KS goodness-of-fit test (n=84k) — log-normal beats beta in all 4 size bands. Software effort has heavier right tails than beta captures.

4. Review minutes halved
Old: S=30, M=60, L=120, XL=240 (standard depth).
New: S=20, M=30, L=60, XL=120.
Evidence: Project-22 reviews (n=290 joined) — 17-20 min median regardless of size. Conservative reduction given this is human-code review, not AI-generated code review.

5. Output token ratio: S/M raised
Old: S=0.25, M=0.28. New: S=0.30, M=0.30.
Evidence: Aider leaderboard (n=23 models) — 0.31 median output ratio (excluding reasoning tokens).

New Validation Infrastructure

  • tests/deep_validation.py — 11-analysis script producing Parameter Audit Cards with bootstrap CIs
  • datasets/download_benchmarks.sh — downloads METR, OpenHands, Aider benchmark data
  • Bundled datasets: tokenomics.csv (arXiv 2601.14470), onprem-tokens.csv
  • 19 new tests (63 total) encoding research-backed properties

Data Sources

Dataset Size What It Validates
CESAW 61k tasks Distribution fitting, confidence multipliers, base rounds
SiP 12k tasks Task type multipliers, confidence multipliers
Project-22 616 stories + 1441 reviews Review times, human fix ratio
Renzo Pomodoro 10k tasks Distribution fitting, confidence multipliers
METR Time Horizons 24k runs / 228 tasks Agent effectiveness by size
Aider Leaderboard ~50 models / 93 entries Output token ratio, cost model
Tokenomics (ChatDev) 30 tasks / 6 phases Per-phase token distribution

Version

0.4.0 → 0.5.0

Files Changed

File Changes
references/formulas.md 5 parameter updates, log-normal PERT, size-dependent multipliers
tests/test_formulas.py Updated constants + 19 new validation tests (63 total)
tests/deep_validation.py New — 11-analysis deep validation script
datasets/download_benchmarks.sh New — benchmark data download script
datasets/benchmarks/*.csv New — bundled token datasets
docs/deep-validation-findings.md New — combined findings document
datasets/README.md Added benchmark download instructions
.gitignore Added benchmark data exclusions
SKILL.md Version bump, updated descriptions
README.md Updated parameters, research sources, file structure
evals/eval-regression.md Updated baselines for new values

v0.4.0 — Token Estimation, 4 New PM Tools

11 Mar 19:41
1a0edba

Choose a tag to compare

What's New

Token & Cost Estimation (Step 15)

  • Computes per-task token consumption based on complexity × maturity lookup tables
  • Splits into input/output tokens using complexity-specific output ratios
  • PERT expected tokens with min/max range
  • Optional API cost estimation across three model tiers:
    • Economy — Haiku, GPT-4o Mini, Gemini Flash
    • Standard — Sonnet, GPT-4o, Gemini 2.5 Pro
    • Premium — Opus, GPT-5
  • Includes per-model pricing reference table (last verified March 2026)
  • Tokens appear in one-line summary: 10-26 agent rounds (~180k tokens)
  • Token Estimate and Model Tier rows added to breakdown table
  • Cost only shown when show_cost=true (off by default)

4 New PM Tool Integrations

  • Asana — custom fields for sizing, risk, agent rounds, tokens; subtasks native
  • Azure DevOps — tags for size/risk, native Original Estimate, HTML table in Description
  • Zenhub — GitHub labels + body sections, Zenhub Estimate for points
  • Shortcut — labels for size/risk, custom fields on Team plan+, native Estimate for points

Total supported trackers: 10 (was 6)

New Inputs

  • model_tier — economy/standard/premium (default: standard)
  • show_cost — boolean (default: false)
  • Question #14 added to detailed questionnaire path

Test Coverage

  • estimate_tokens() function with full Step 15 math
  • Token estimation integrated into main estimate() pipeline
  • TestCase7 — exact token math, output ratios, cost arithmetic
  • TestCase8 — complexity scaling, multi-agent multiplier, maturity variation, PERT cost formula, tier pricing comparison
  • TestCase9 — integration: token_estimate present in estimate() output, correct structure, rounds consistency, show_cost toggle
  • 44 tests total (was 25), all passing
  • Original 6 test cases unchanged (backward compatible)

Version

0.3.0 → 0.4.0

Files Changed

File Changes
references/formulas.md Token lookup tables, Step 15 formula, output schema fields
tests/test_formulas.py estimate_tokens(), TestCase7-9, integration into estimate()
references/questionnaire.md Question #14, mapping table update
references/output-schema.md Token rows in templates, 4 new PM tool sections, batch Tokens column
SKILL.md Tracker list, pipeline step, version bump
README.md Feature list, tracker list

v0.3.0 — Instant Mode, Cooperation Modes, Research Grounding

08 Mar 23:01

Choose a tag to compare

What's New

Three Estimation Modes

  • Instant — zero questions, infer everything from the task description
  • Quick — 4 questions with sensible defaults
  • Detailed — 13 questions, full control over every parameter

Cooperation Mode Detection

Auto-detects your team's working style from intake answers:

  • Human-only → story points for sizing and velocity
  • Hybrid → dual-track: points for sizing, hours for planning
  • Agent-first → plan by human review capacity in hours

Small Council Validation

Subagent perspectives (Optimist, Skeptic, Historian) review estimates for M+ tasks before output.

Sprint Velocity Tracking

Mode-aware velocity tracking with rolling averages, capacity planning, and divergence detection for hybrid teams.

Research Grounding

Formulas now cite 15+ empirical studies:

  • METR Time Horizons (R²=0.83, ~230 tasks) — agent effectiveness decay
  • METR RCT (16 devs, 246 tasks) — productivity measurement
  • Faros AI (10k devs, 1.2k teams) — review bottleneck validation
  • PairCoder (400+ tasks) — human time dominance confirmation
  • Agent effectiveness clarified as work acceleration, not autonomous completion rate
  • Bug-fix multiplier bumped 1.2x → 1.3x based on empirical data

GitHub Action

Auto-estimate issues labeled needs-estimate with PERT formulas.

AskUserQuestion Integration

Structured dropdowns for all user interactions when the tool is available.

Full Changelog

v0.1.1...v0.3.0

v0.2.0 — Multi-skill repo structure

07 Mar 20:28

Choose a tag to compare

Agent Skills v0.2.0

Breaking changes

  • Repo renamed from progressive-estimation to agent-skills (GitHub redirects old URLs)
  • Skill files moved to skills/progressive-estimation/ for multi-skill support

What's new

  • Multi-skill repo structure — ready for additional skills under skills/
  • New README — landing page listing all available skills
  • Updated install commands:
    # Via skills.sh
    npx skills add Enreign/agent-skills
    
    # Manual (Claude Code)
    git clone https://github.com/Enreign/agent-skills.git ~/.claude/skills/agent-skills

Skills included

  • progressive-estimation — PERT estimation for AI-assisted development work

⚠️ Early development — formulas, multipliers, and defaults may change between versions.

v0.1.1 — skills.sh compatibility + structured intake

07 Mar 20:17

Choose a tag to compare

Progressive Estimation v0.1.1

What's new

  • skills.sh compatibility — install with npx skills add Enreign/progressive-estimation
  • YAML frontmatter in SKILL.md for automatic skill discovery
  • Structured intake — uses AskUserQuestion tool when available for dropdown-style questionnaire instead of free-text back-and-forth
  • Updated banner — new Estimation artwork

Install

# Via skills.sh (recommended)
npx skills add Enreign/progressive-estimation

# Manual (Claude Code)
git clone https://github.com/Enreign/progressive-estimation.git ~/.claude/skills/progressive-estimation

Or download the .skill archive from this release.


⚠️ Early development — formulas, multipliers, and defaults may change between versions.

v0.1.0 — Initial Public Release

07 Mar 19:49
Immutable release. Only release title and notes can be modified.

Choose a tag to compare

Progressive Estimation v0.1.0

First public release of the Progressive Estimation skill for AI coding assistants.

What's included

  • PERT three-point estimation with confidence bands (68%, 95%)
  • Four workflow paths: quick/detailed × single/batch
  • Agent effectiveness decay modeling based on METR research
  • Calibration system with PRED(25) tracking
  • Tracker output for Linear, JIRA, ClickUp, GitHub Issues, Monday, and GitLab
  • Anti-pattern guards for common estimation mistakes
  • Eval suite for regression, batch, quick, and hybrid scenarios

Install

git clone https://github.com/Enreign/progressive-estimation.git ~/.claude/skills/progressive-estimation

Or download the .skill archive from this release.


⚠️ Early development — formulas, multipliers, and defaults may change between versions.