Agent Effectiveness

The Core Insight

AI coding agents are not equally effective across all task sizes. Based on METR time horizon research (24k runs, 228 tasks, Jan 2025–Feb 2026):

Agents achieve ~100% success on tasks taking humans <4 minutes
Success drops to <10% on tasks taking humans >4 hours
The agent capability "time horizon" doubles every ~7 months (accelerated to ~4 months in 2024-2025)

Effectiveness vs. Autonomous Completion

Our effectiveness values measure work acceleration — how much the agent speeds up the overall task — not autonomous completion rate. METR's autonomous success rates (from 24k runs) are lower: S≈0.83, M≈0.25, L≈0.15, XL≈0.22 for frontier models. Agents that fail to complete autonomously still produce useful partial output (scaffolding, research, partial implementation). Our values sit between METR's autonomous rates and 1.0, reflecting this partial-credit reality.

This is supported by PairCoder's data (400+ tasks): agent execution time is 5-15 minutes regardless of task size, but human review/correction scales with complexity. The agent always contributes — the question is how much human intervention is needed.

Effectiveness by Task Size

Complexity	Agent Effectiveness	What This Means
S	90%	Agent handles most of the work. Minimal human correction needed.
M	50%	Agent and human split the work. Significant human intervention for edge cases.
L	35%	Mostly human-driven. Agent assists with subtasks, human steers direction.
XL	30%	Human-driven with agent assist. Agent handles subtasks, human drives architecture.

How It Affects Estimates

Agent effectiveness adjusts the human fix ratio — how much of the agent's output needs manual correction:

adjusted_fix_ratio = base_fix_ratio + (1 - agent_effectiveness) × 0.3

Examples:

S task (0.9 effectiveness): 0.20 + (1 - 0.9) × 0.3 = 0.23 → 23% fix rate
M task (0.5 effectiveness): 0.20 + (1 - 0.5) × 0.3 = 0.35 → 35% fix rate
L task (0.35 effectiveness): 0.20 + (1 - 0.35) × 0.3 = 0.395 → 40% fix rate
XL task (0.3 effectiveness): 0.20 + (1 - 0.3) × 0.3 = 0.41 → 41% fix rate

This means larger tasks automatically allocate more human time, which matches real-world experience.

The METR Productivity Paradox

A key finding from METR's 2025 randomized controlled trial:

Developers using AI tools took 19% longer on complex, familiar codebases — despite believing they were 20% faster.

This doesn't mean agents aren't useful. It means:

Agent overhead (prompting, reviewing, correcting) can exceed time saved on complex work
Agents excel at well-scoped, smaller tasks
The benefit is in throughput of small tasks, not acceleration of large ones

Implications for Estimation

Don't assume agents make everything faster. For XL tasks, the agent is an assistant, not a driver.
Break large tasks into smaller ones. Each S/M subtask gets high agent effectiveness.
Agent effectiveness improves over time. As models improve, these ratios should be recalibrated. See Calibration.

Compounding Error

A key reason effectiveness drops non-linearly with task size is compounding per-step error:

90% per-step accuracy across 10 steps = 35% end-to-end success
95% per-step across 5 steps = 77% success

Larger tasks have more steps where errors compound. This is why breaking XL tasks into S/M subtasks is so effective — each subtask gets high per-step accuracy. See O'Reilly and Prodigal.

Future-Proofing

The METR time horizon analysis shows agent capabilities doubling every ~7 months (accelerated to ~4 months recently). Specific model time horizons at 50% success: Claude Opus 4.6 ~14.5 hrs, Claude Opus 4.5 ~4h49m, GPT-5 ~2h17m.

These effectiveness values will need periodic recalibration as models improve. Track your team's actual agent effectiveness per size in the Calibration feedback loop.

Getting Started

Core Concepts

Reference

Accuracy

Contributors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Effectiveness

The Core Insight

Effectiveness vs. Autonomous Completion

Effectiveness by Task Size

How It Affects Estimates

The METR Productivity Paradox

Implications for Estimation

Compounding Error

Future-Proofing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally