-
Notifications
You must be signed in to change notification settings - Fork 0
Research
The estimation model is informed by academic research and industry studies.
- 24,008 runs across 228 tasks, Jan 2025–Feb 2026
- Agents achieve ~100% success on tasks taking humans <4 minutes
- Success drops to <10% on tasks taking humans >4 hours
- Autonomous completion by size band: S=0.83, M=0.25, L=0.15, XL=0.22 (frontier models)
- 50% success time horizon for frontier models (early 2026): Claude Opus 4.6 ~14.5 hrs, Claude Opus 4.5 ~4h49m, GPT-5 ~2h17m
- Agent capability "time horizon" doubles every ~7 months (accelerated to ~4 months in 2024-2025)
- Source | TH 1.1 Update | Paper
How we use it: Agent effectiveness decay table (S=0.9, M=0.5, L=0.35, XL=0.3). These measure work acceleration, not autonomous completion — see Agent Effectiveness for the distinction. Values calibrated in v0.5.0 using METR autonomous rates as lower bound with partial-credit adjustment.
Model vintage: Capabilities double ~every 7 months. Our values reflect early-2026 frontier models and should be re-calibrated with newer models.
- Randomized controlled trial: 16 experienced developers, 246 real issues
- 143 hours of screen recordings manually labeled at ~10-second resolution
- Result: Developers using AI tools took 19% longer on complex codebases
- Developers believed they were 20% faster (40-percentage-point perception gap)
- Source | Paper | 2026 Update
- Success rate declines exponentially:
success = e^(-λ × task_duration) - Each agent characterized by a "half-life" — constant failure probability per unit time
- Later refined: agents have declining hazard rates (better than exponential on longer tasks)
- Paper
- Easy tasks (<15 min fix, 196 tasks): top systems achieve 80%+ pass rate
- Hard tasks (>1 hour fix, 45 tasks): top systems drop to 20-25%
- Lines changed increase 11× from easy to hard
- SWE-bench Pro (longer tasks): best models ~23% on public set; Claude Opus 4.6 59%
- SWE-bench Pro | Difficulty Breakdown
- 96 engineers, controlled experiment
- AI group completed tasks 21% faster (96 min vs 114 min)
- Source
- Controlled experiment with HTTP server task
- Copilot group completed tasks 55% faster
- Caveat: single scoped task, vendor-sponsored
- Paper
- Developers using Copilot completed 26% more tasks
- Code commits up 13.5%, compilation frequency up 38.4%
- Source
- Objective telemetry (not surveys)
- No statistically significant change in PR throughput, cycle time, or merge time
- Bug rates increased 41% for Copilot users
- Source
- Individual gains: +21% tasks completed, +98% PRs merged per developer
- Organizational paradox: company-level delivery metrics remained flat
- PR review times ballooned 91% — human approval became the bottleneck
- Source
How we use it: Validates that our formula correctly weights human review/fix time as the dominant cost component, not agent execution time.
- Agent execution: 5-15 minutes per task
- Human review/validation: 30-90 minutes
- Re-prompting & correction cycles: 15-60 minutes
- Wall-clock:agent-time ratio exceeding 10:1 signals problems
- Source
How we use it: Confirms our model structure — agent rounds are a small fraction of total time; human review, planning, and fix time dominate.
- Median time savings: 84% per conversation
- 79% of Claude Code conversations are automation tasks
- Caveat: cannot account for post-conversation validation time
- Source
- Developers spend 30-40% of total development capacity on bug fixing
- Investigation and debugging consume 50%+ of total bug-fix time
- Our multiplier: 1.3× (up from 1.2× based on empirical data)
- Source
- 83% of data migration projects fail or exceed budgets/schedules (Gartner)
- Average cost overrun: 30%, average time overrun: 41%
- Pre-migration planning consumes 50-70% of total effort
- Our multiplier: 2.0× (empirically supported, possibly conservative)
- Source
- Testing accounts for 20-40% of total software development cost
- Writing automated tests is roughly proportional to but exceeds coding effort
- Our multiplier: 1.3× (within empirical range of 1.2-1.5×)
- n=75 projects from CSBSG database
- Design phase: 14-17% of total project effort
- Supports our design multiplier of 1.2×
- Paper
- Weighted average:
(O + 4M + P) / 6 - Standard deviation:
(P - O) / 6 - Produces confidence intervals at 68%, 95%, 99.7%
- Wikipedia
How we use it: Every estimate produces a PERT expected value and SD. As of v0.5.0, we use the geometric mean as the most-likely value (log-normal weighting) instead of the arithmetic midpoint (beta assumption). KS goodness-of-fit tests on 84k tasks showed log-normal fits better in all 4 size bands. See PERT Statistics.
- Risk multipliers by confidence level
- Separate "expected" from "committed" estimates
- Three-tier stakeholder communication (commitment / stretch / low priority)
- Source
How we use it: Inspired the confidence level approach. As of v0.5.0, multipliers are size-dependent (derived from 84k estimate-actual pairs) rather than Shore's original flat values. See Confidence Levels.
- 86k+ tasks across 4 datasets: CESAW (61k), SiP (12k), Project-22 (616 stories + 1441 reviews), Renzo (10k)
- Source
How we use it: Primary data source for v0.5.0 deep validation. Used to derive size-dependent confidence multipliers, validate PERT distribution assumption, calibrate review times, estimate human fix ratio, and reverse-engineer implied base rounds.
- ~50 models with cost, tokens, and pass rates on coding benchmarks
- Source
How we use it: Validated output token ratio (median 0.31 excluding reasoning tokens) and cost model (economy/premium tiers well-calibrated, standard tier over-estimates for simple cases).
- Most expert estimates fall within 20-30% of actuals
- Calibration feedback improves accuracy from 64% to 81%
- Feedback loops are the most effective accuracy improvement
- Source
How we use it: PRED(25) target (75%), calibration feedback loop, incremental adjustment rules.
- At project start, estimates can be off by 4× in either direction (0.25× to 4×)
- Narrows only when decisions eliminate variability, not by time passing
- Requirements complete → 1.5× spread; Design complete → 1.1× spread
- Source
How we use it: Definition phase spread multipliers (concept=2.0×, requirements=1.5×, design=1.2×, ready=1.0×).
- Most estimation models are accurate within 30% of actual cost only 57% of the time
- No proof any model consistently achieves within-25% accuracy 75% of the time without calibration
- ML-assisted estimation with 6+ months of historical data can improve accuracy 25-40%
- 90% of respondents use AI tools at work (up 14% YoY)
- AI correlates with higher throughput but also higher instability
- AI amplifies existing team dynamics — good teams get better, struggling teams get worse
- Source
- AI coding tool adoption: 61% → 90% of teams in one year
- Only ~30% of organizations seeing substantial productivity gains despite 90% adoption
- Companies with 80-100% developer adoption saw 110%+ gains
- Source
- 84% of organizations have integrated AI into pipelines (Digital.ai)
- 66% of developers say AI code is "almost right, but not quite" (Stack Overflow)
- 45.2% cite time debugging AI output as a significant cost
A critical factor in why L/XL tasks fail disproportionately:
- 90% per-step accuracy across 10 steps = 35% end-to-end success
- 95% per-step across 5 steps = 77% success
- 98% per-step across 10 steps = ~82% success
- Source | O'Reilly
How we use it: Explains why agent effectiveness drops non-linearly with task size. More steps = more compounding failure probability.
- Original 1994 report claimed 16.2% on-time/on-budget, 189% average cost overrun
- Subsequent academic review found severe sampling bias toward failure projects
- Real-world overruns are significant but likely less dramatic than CHAOS suggests
- Analysis
How we use it: Context for why estimation matters, but with caveats about the data.
Getting Started
Core Concepts
- How It Works
- Task Types
- Agent Effectiveness
- Confidence Levels
- Cone of Uncertainty
- PERT Statistics
- Small Council
Reference
Accuracy
Contributors