Skip to content

Research

Enreign edited this page Mar 12, 2026 · 4 revisions

The estimation model is informed by academic research and industry studies.

Agent Effectiveness

METR — Time Horizon Analysis (March 2025, continuously updated)

  • 24,008 runs across 228 tasks, Jan 2025–Feb 2026
  • Agents achieve ~100% success on tasks taking humans <4 minutes
  • Success drops to <10% on tasks taking humans >4 hours
  • Autonomous completion by size band: S=0.83, M=0.25, L=0.15, XL=0.22 (frontier models)
  • 50% success time horizon for frontier models (early 2026): Claude Opus 4.6 ~14.5 hrs, Claude Opus 4.5 ~4h49m, GPT-5 ~2h17m
  • Agent capability "time horizon" doubles every ~7 months (accelerated to ~4 months in 2024-2025)
  • Source | TH 1.1 Update | Paper

How we use it: Agent effectiveness decay table (S=0.9, M=0.5, L=0.35, XL=0.3). These measure work acceleration, not autonomous completion — see Agent Effectiveness for the distinction. Values calibrated in v0.5.0 using METR autonomous rates as lower bound with partial-credit adjustment.

Model vintage: Capabilities double ~every 7 months. Our values reflect early-2026 frontier models and should be re-calibrated with newer models.

METR — AI Developer Productivity RCT (July 2025)

  • Randomized controlled trial: 16 experienced developers, 246 real issues
  • 143 hours of screen recordings manually labeled at ~10-second resolution
  • Result: Developers using AI tools took 19% longer on complex codebases
  • Developers believed they were 20% faster (40-percentage-point perception gap)
  • Source | Paper | 2026 Update

Toby Ord — Half-Life Model for AI Agents

  • Success rate declines exponentially: success = e^(-λ × task_duration)
  • Each agent characterized by a "half-life" — constant failure probability per unit time
  • Later refined: agents have declining hazard rates (better than exponential on longer tasks)
  • Paper

SWE-bench Data by Difficulty

  • Easy tasks (<15 min fix, 196 tasks): top systems achieve 80%+ pass rate
  • Hard tasks (>1 hour fix, 45 tasks): top systems drop to 20-25%
  • Lines changed increase 11× from easy to hard
  • SWE-bench Pro (longer tasks): best models ~23% on public set; Claude Opus 4.6 59%
  • SWE-bench Pro | Difficulty Breakdown

Productivity Studies

Google RCT (2025)

  • 96 engineers, controlled experiment
  • AI group completed tasks 21% faster (96 min vs 114 min)
  • Source

GitHub Copilot RCT (Peng et al., 2023)

  • Controlled experiment with HTTP server task
  • Copilot group completed tasks 55% faster
  • Caveat: single scoped task, vendor-sponsored
  • Paper

Microsoft/Accenture Field Experiment

  • Developers using Copilot completed 26% more tasks
  • Code commits up 13.5%, compilation frequency up 38.4%
  • Source

Uplevel Study (~800 Engineers)

  • Objective telemetry (not surveys)
  • No statistically significant change in PR throughput, cycle time, or merge time
  • Bug rates increased 41% for Copilot users
  • Source

Faros AI (10,000+ Developers, 1,255 Teams)

  • Individual gains: +21% tasks completed, +98% PRs merged per developer
  • Organizational paradox: company-level delivery metrics remained flat
  • PR review times ballooned 91% — human approval became the bottleneck
  • Source

How we use it: Validates that our formula correctly weights human review/fix time as the dominant cost component, not agent execution time.

PairCoder (400+ Tasks with Time Tracking)

  • Agent execution: 5-15 minutes per task
  • Human review/validation: 30-90 minutes
  • Re-prompting & correction cycles: 15-60 minutes
  • Wall-clock:agent-time ratio exceeding 10:1 signals problems
  • Source

How we use it: Confirms our model structure — agent rounds are a small fraction of total time; human review, planning, and fix time dominate.

Anthropic Productivity Research (100k Conversations)

  • Median time savings: 84% per conversation
  • 79% of Claude Code conversations are automation tasks
  • Caveat: cannot account for post-conversation validation time
  • Source

Task Type Multipliers

Bug-Fix Overhead

  • Developers spend 30-40% of total development capacity on bug fixing
  • Investigation and debugging consume 50%+ of total bug-fix time
  • Our multiplier: 1.3× (up from 1.2× based on empirical data)
  • Source

Data Migration Risk

  • 83% of data migration projects fail or exceed budgets/schedules (Gartner)
  • Average cost overrun: 30%, average time overrun: 41%
  • Pre-migration planning consumes 50-70% of total effort
  • Our multiplier: 2.0× (empirically supported, possibly conservative)
  • Source

Testing Effort

  • Testing accounts for 20-40% of total software development cost
  • Writing automated tests is roughly proportional to but exceeds coding effort
  • Our multiplier: 1.3× (within empirical range of 1.2-1.5×)

Software Phase Distribution (Yang & He, ESEM 2008)

  • n=75 projects from CSBSG database
  • Design phase: 14-17% of total project effort
  • Supports our design multiplier of 1.2×
  • Paper

Estimation Methods

PERT Three-Point Estimation (Log-Normal)

  • Weighted average: (O + 4M + P) / 6
  • Standard deviation: (P - O) / 6
  • Produces confidence intervals at 68%, 95%, 99.7%
  • Wikipedia

How we use it: Every estimate produces a PERT expected value and SD. As of v0.5.0, we use the geometric mean as the most-likely value (log-normal weighting) instead of the arithmetic midpoint (beta assumption). KS goodness-of-fit tests on 84k tasks showed log-normal fits better in all 4 size bands. See PERT Statistics.

James Shore — Risk Management for Commitments (2008)

  • Risk multipliers by confidence level
  • Separate "expected" from "committed" estimates
  • Three-tier stakeholder communication (commitment / stretch / low priority)
  • Source

How we use it: Inspired the confidence level approach. As of v0.5.0, multipliers are size-dependent (derived from 84k estimate-actual pairs) rather than Shore's original flat values. See Confidence Levels.

Derek Jones — Software Estimation Datasets

  • 86k+ tasks across 4 datasets: CESAW (61k), SiP (12k), Project-22 (616 stories + 1441 reviews), Renzo (10k)
  • Source

How we use it: Primary data source for v0.5.0 deep validation. Used to derive size-dependent confidence multipliers, validate PERT distribution assumption, calibrate review times, estimate human fix ratio, and reverse-engineer implied base rounds.

Aider Leaderboard

  • ~50 models with cost, tokens, and pass rates on coding benchmarks
  • Source

How we use it: Validated output token ratio (median 0.31 excluding reasoning tokens) and cost model (economy/premium tiers well-calibrated, standard tier over-estimates for simple cases).

Calibration

Jorgensen & Grimstad — Expert Estimation Review (2004)

  • Most expert estimates fall within 20-30% of actuals
  • Calibration feedback improves accuracy from 64% to 81%
  • Feedback loops are the most effective accuracy improvement
  • Source

How we use it: PRED(25) target (75%), calibration feedback loop, incremental adjustment rules.

Uncertainty

Construx — Cone of Uncertainty

  • At project start, estimates can be off by 4× in either direction (0.25× to 4×)
  • Narrows only when decisions eliminate variability, not by time passing
  • Requirements complete → 1.5× spread; Design complete → 1.1× spread
  • Source

How we use it: Definition phase spread multipliers (concept=2.0×, requirements=1.5×, design=1.2×, ready=1.0×).

Traditional Estimation Accuracy

  • Most estimation models are accurate within 30% of actual cost only 57% of the time
  • No proof any model consistently achieves within-25% accuracy 75% of the time without calibration
  • ML-assisted estimation with 6+ months of historical data can improve accuracy 25-40%

Industry Data

DORA 2025 Report

  • 90% of respondents use AI tools at work (up 14% YoY)
  • AI correlates with higher throughput but also higher instability
  • AI amplifies existing team dynamics — good teams get better, struggling teams get worse
  • Source

Jellyfish: 2025 AI Metrics in Review

  • AI coding tool adoption: 61% → 90% of teams in one year
  • Only ~30% of organizations seeing substantial productivity gains despite 90% adoption
  • Companies with 80-100% developer adoption saw 110%+ gains
  • Source

Adoption & Perception

  • 84% of organizations have integrated AI into pipelines (Digital.ai)
  • 66% of developers say AI code is "almost right, but not quite" (Stack Overflow)
  • 45.2% cite time debugging AI output as a significant cost

Compounding Error in Agents

A critical factor in why L/XL tasks fail disproportionately:

  • 90% per-step accuracy across 10 steps = 35% end-to-end success
  • 95% per-step across 5 steps = 77% success
  • 98% per-step across 10 steps = ~82% success
  • Source | O'Reilly

How we use it: Explains why agent effectiveness drops non-linearly with task size. More steps = more compounding failure probability.

Project Outcomes

Standish CHAOS Reports

  • Original 1994 report claimed 16.2% on-time/on-budget, 189% average cost overrun
  • Subsequent academic review found severe sampling bias toward failure projects
  • Real-world overruns are significant but likely less dramatic than CHAOS suggests
  • Analysis

How we use it: Context for why estimation matters, but with caveats about the data.

Clone this wiki locally