Research

The estimation model is informed by academic research and industry studies.

Agent Effectiveness

METR — Time Horizon Analysis (March 2025, continuously updated)

24,008 runs across 228 tasks, Jan 2025–Feb 2026
Agents achieve ~100% success on tasks taking humans <4 minutes
Success drops to <10% on tasks taking humans >4 hours
Autonomous completion by size band: S=0.83, M=0.25, L=0.15, XL=0.22 (frontier models)
50% success time horizon for frontier models (early 2026): Claude Opus 4.6 ~14.5 hrs, Claude Opus 4.5 ~4h49m, GPT-5 ~2h17m
Agent capability "time horizon" doubles every ~7 months (accelerated to ~4 months in 2024-2025)
Source | TH 1.1 Update | Paper

How we use it: Agent effectiveness decay table (S=0.9, M=0.5, L=0.35, XL=0.3). These measure work acceleration, not autonomous completion — see Agent Effectiveness for the distinction. Values calibrated in v0.5.0 using METR autonomous rates as lower bound with partial-credit adjustment.

Model vintage: Capabilities double ~every 7 months. Our values reflect early-2026 frontier models and should be re-calibrated with newer models.

METR — AI Developer Productivity RCT (July 2025)

Randomized controlled trial: 16 experienced developers, 246 real issues
143 hours of screen recordings manually labeled at ~10-second resolution
Result: Developers using AI tools took 19% longer on complex codebases
Developers believed they were 20% faster (40-percentage-point perception gap)
Source | Paper | 2026 Update

Toby Ord — Half-Life Model for AI Agents

Success rate declines exponentially: success = e^(-λ × task_duration)
Each agent characterized by a "half-life" — constant failure probability per unit time
Later refined: agents have declining hazard rates (better than exponential on longer tasks)
Paper

SWE-bench Data by Difficulty

Easy tasks (<15 min fix, 196 tasks): top systems achieve 80%+ pass rate
Hard tasks (>1 hour fix, 45 tasks): top systems drop to 20-25%
Lines changed increase 11× from easy to hard
SWE-bench Pro (longer tasks): best models ~23% on public set; Claude Opus 4.6 59%
SWE-bench Pro | Difficulty Breakdown

Productivity Studies

Google RCT (2025)

96 engineers, controlled experiment
AI group completed tasks 21% faster (96 min vs 114 min)
Source

GitHub Copilot RCT (Peng et al., 2023)

Controlled experiment with HTTP server task
Copilot group completed tasks 55% faster
Caveat: single scoped task, vendor-sponsored
Paper

Microsoft/Accenture Field Experiment

Developers using Copilot completed 26% more tasks
Code commits up 13.5%, compilation frequency up 38.4%
Source

Uplevel Study (~800 Engineers)

Objective telemetry (not surveys)
No statistically significant change in PR throughput, cycle time, or merge time
Bug rates increased 41% for Copilot users
Source

Faros AI (10,000+ Developers, 1,255 Teams)

Individual gains: +21% tasks completed, +98% PRs merged per developer
Organizational paradox: company-level delivery metrics remained flat
PR review times ballooned 91% — human approval became the bottleneck
Source

How we use it: Validates that our formula correctly weights human review/fix time as the dominant cost component, not agent execution time.

PairCoder (400+ Tasks with Time Tracking)

Agent execution: 5-15 minutes per task
Human review/validation: 30-90 minutes
Re-prompting & correction cycles: 15-60 minutes
Wall-clock:agent-time ratio exceeding 10:1 signals problems
Source

How we use it: Confirms our model structure — agent rounds are a small fraction of total time; human review, planning, and fix time dominate.

Anthropic Productivity Research (100k Conversations)

Median time savings: 84% per conversation
79% of Claude Code conversations are automation tasks
Caveat: cannot account for post-conversation validation time
Source

Task Type Multipliers

Bug-Fix Overhead

Developers spend 30-40% of total development capacity on bug fixing
Investigation and debugging consume 50%+ of total bug-fix time
Our multiplier: 1.3× (up from 1.2× based on empirical data)
Source

Data Migration Risk

83% of data migration projects fail or exceed budgets/schedules (Gartner)
Average cost overrun: 30%, average time overrun: 41%
Pre-migration planning consumes 50-70% of total effort
Our multiplier: 2.0× (empirically supported, possibly conservative)
Source

Testing Effort

Testing accounts for 20-40% of total software development cost
Writing automated tests is roughly proportional to but exceeds coding effort
Our multiplier: 1.3× (within empirical range of 1.2-1.5×)

Software Phase Distribution (Yang & He, ESEM 2008)

n=75 projects from CSBSG database
Design phase: 14-17% of total project effort
Supports our design multiplier of 1.2×
Paper

Estimation Methods

PERT Three-Point Estimation (Log-Normal)

Weighted average: (O + 4M + P) / 6
Standard deviation: (P - O) / 6
Produces confidence intervals at 68%, 95%, 99.7%
Wikipedia

How we use it: Every estimate produces a PERT expected value and SD. As of v0.5.0, we use the geometric mean as the most-likely value (log-normal weighting) instead of the arithmetic midpoint (beta assumption). KS goodness-of-fit tests on 84k tasks showed log-normal fits better in all 4 size bands. See PERT Statistics.

James Shore — Risk Management for Commitments (2008)

Risk multipliers by confidence level
Separate "expected" from "committed" estimates
Three-tier stakeholder communication (commitment / stretch / low priority)
Source

How we use it: Inspired the confidence level approach. As of v0.5.0, multipliers are size-dependent (derived from 84k estimate-actual pairs) rather than Shore's original flat values. See Confidence Levels.

Derek Jones — Software Estimation Datasets

86k+ tasks across 4 datasets: CESAW (61k), SiP (12k), Project-22 (616 stories + 1441 reviews), Renzo (10k)
Source

How we use it: Primary data source for v0.5.0 deep validation. Used to derive size-dependent confidence multipliers, validate PERT distribution assumption, calibrate review times, estimate human fix ratio, and reverse-engineer implied base rounds.

Aider Leaderboard

~50 models with cost, tokens, and pass rates on coding benchmarks
Source

How we use it: Validated output token ratio (median 0.31 excluding reasoning tokens) and cost model (economy/premium tiers well-calibrated, standard tier over-estimates for simple cases).

Calibration

Jorgensen & Grimstad — Expert Estimation Review (2004)

Most expert estimates fall within 20-30% of actuals
Calibration feedback improves accuracy from 64% to 81%
Feedback loops are the most effective accuracy improvement
Source

How we use it: PRED(25) target (75%), calibration feedback loop, incremental adjustment rules.

Uncertainty

Construx — Cone of Uncertainty

At project start, estimates can be off by 4× in either direction (0.25× to 4×)
Narrows only when decisions eliminate variability, not by time passing
Requirements complete → 1.5× spread; Design complete → 1.1× spread
Source

How we use it: Definition phase spread multipliers (concept=2.0×, requirements=1.5×, design=1.2×, ready=1.0×).

Traditional Estimation Accuracy

Most estimation models are accurate within 30% of actual cost only 57% of the time
No proof any model consistently achieves within-25% accuracy 75% of the time without calibration
ML-assisted estimation with 6+ months of historical data can improve accuracy 25-40%

Industry Data

DORA 2025 Report

90% of respondents use AI tools at work (up 14% YoY)
AI correlates with higher throughput but also higher instability
AI amplifies existing team dynamics — good teams get better, struggling teams get worse
Source

Jellyfish: 2025 AI Metrics in Review

AI coding tool adoption: 61% → 90% of teams in one year
Only ~30% of organizations seeing substantial productivity gains despite 90% adoption
Companies with 80-100% developer adoption saw 110%+ gains
Source

Adoption & Perception

84% of organizations have integrated AI into pipelines (Digital.ai)
66% of developers say AI code is "almost right, but not quite" (Stack Overflow)
45.2% cite time debugging AI output as a significant cost

Compounding Error in Agents

A critical factor in why L/XL tasks fail disproportionately:

90% per-step accuracy across 10 steps = 35% end-to-end success
95% per-step across 5 steps = 77% success
98% per-step across 10 steps = ~82% success
Source | O'Reilly

How we use it: Explains why agent effectiveness drops non-linearly with task size. More steps = more compounding failure probability.

Project Outcomes

Standish CHAOS Reports

Original 1994 report claimed 16.2% on-time/on-budget, 189% average cost overrun
Subsequent academic review found severe sampling bias toward failure projects
Real-world overruns are significant but likely less dramatic than CHAOS suggests
Analysis

How we use it: Context for why estimation matters, but with caveats about the data.

Getting Started

Core Concepts

Reference

Accuracy

Contributors

Research

Agent Effectiveness

METR — Time Horizon Analysis (March 2025, continuously updated)

METR — AI Developer Productivity RCT (July 2025)

Toby Ord — Half-Life Model for AI Agents

SWE-bench Data by Difficulty

Productivity Studies

Google RCT (2025)

GitHub Copilot RCT (Peng et al., 2023)

Microsoft/Accenture Field Experiment

Uplevel Study (~800 Engineers)

Faros AI (10,000+ Developers, 1,255 Teams)

PairCoder (400+ Tasks with Time Tracking)

Anthropic Productivity Research (100k Conversations)

Task Type Multipliers

Bug-Fix Overhead

Data Migration Risk

Testing Effort

Software Phase Distribution (Yang & He, ESEM 2008)

Estimation Methods

PERT Three-Point Estimation (Log-Normal)

James Shore — Risk Management for Commitments (2008)

Derek Jones — Software Estimation Datasets

Aider Leaderboard

Calibration

Jorgensen & Grimstad — Expert Estimation Review (2004)

Uncertainty

Construx — Cone of Uncertainty

Traditional Estimation Accuracy

Industry Data

DORA 2025 Report

Jellyfish: 2025 AI Metrics in Review

Adoption & Perception

Compounding Error in Agents

Project Outcomes

Standish CHAOS Reports

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally