Skip to content

Agent Effectiveness

Enreign edited this page Mar 12, 2026 · 4 revisions

The Core Insight

AI coding agents are not equally effective across all task sizes. Based on METR time horizon research (24k runs, 228 tasks, Jan 2025–Feb 2026):

  • Agents achieve ~100% success on tasks taking humans <4 minutes
  • Success drops to <10% on tasks taking humans >4 hours
  • The agent capability "time horizon" doubles every ~7 months (accelerated to ~4 months in 2024-2025)

Effectiveness vs. Autonomous Completion

Our effectiveness values measure work acceleration — how much the agent speeds up the overall task — not autonomous completion rate. METR's autonomous success rates (from 24k runs) are lower: S≈0.83, M≈0.25, L≈0.15, XL≈0.22 for frontier models. Agents that fail to complete autonomously still produce useful partial output (scaffolding, research, partial implementation). Our values sit between METR's autonomous rates and 1.0, reflecting this partial-credit reality.

This is supported by PairCoder's data (400+ tasks): agent execution time is 5-15 minutes regardless of task size, but human review/correction scales with complexity. The agent always contributes — the question is how much human intervention is needed.

Effectiveness by Task Size

Complexity Agent Effectiveness What This Means
S 90% Agent handles most of the work. Minimal human correction needed.
M 50% Agent and human split the work. Significant human intervention for edge cases.
L 35% Mostly human-driven. Agent assists with subtasks, human steers direction.
XL 30% Human-driven with agent assist. Agent handles subtasks, human drives architecture.

How It Affects Estimates

Agent effectiveness adjusts the human fix ratio — how much of the agent's output needs manual correction:

adjusted_fix_ratio = base_fix_ratio + (1 - agent_effectiveness) × 0.3

Examples:

  • S task (0.9 effectiveness): 0.20 + (1 - 0.9) × 0.3 = 0.23 → 23% fix rate
  • M task (0.5 effectiveness): 0.20 + (1 - 0.5) × 0.3 = 0.35 → 35% fix rate
  • L task (0.35 effectiveness): 0.20 + (1 - 0.35) × 0.3 = 0.395 → 40% fix rate
  • XL task (0.3 effectiveness): 0.20 + (1 - 0.3) × 0.3 = 0.41 → 41% fix rate

This means larger tasks automatically allocate more human time, which matches real-world experience.

The METR Productivity Paradox

A key finding from METR's 2025 randomized controlled trial:

Developers using AI tools took 19% longer on complex, familiar codebases — despite believing they were 20% faster.

This doesn't mean agents aren't useful. It means:

  • Agent overhead (prompting, reviewing, correcting) can exceed time saved on complex work
  • Agents excel at well-scoped, smaller tasks
  • The benefit is in throughput of small tasks, not acceleration of large ones

Implications for Estimation

  1. Don't assume agents make everything faster. For XL tasks, the agent is an assistant, not a driver.
  2. Break large tasks into smaller ones. Each S/M subtask gets high agent effectiveness.
  3. Agent effectiveness improves over time. As models improve, these ratios should be recalibrated. See Calibration.

Compounding Error

A key reason effectiveness drops non-linearly with task size is compounding per-step error:

  • 90% per-step accuracy across 10 steps = 35% end-to-end success
  • 95% per-step across 5 steps = 77% success

Larger tasks have more steps where errors compound. This is why breaking XL tasks into S/M subtasks is so effective — each subtask gets high per-step accuracy. See O'Reilly and Prodigal.

Future-Proofing

The METR time horizon analysis shows agent capabilities doubling every ~7 months (accelerated to ~4 months recently). Specific model time horizons at 50% success: Claude Opus 4.6 ~14.5 hrs, Claude Opus 4.5 ~4h49m, GPT-5 ~2h17m.

These effectiveness values will need periodic recalibration as models improve. Track your team's actual agent effectiveness per size in the Calibration feedback loop.

Clone this wiki locally