-
Notifications
You must be signed in to change notification settings - Fork 0
Agent Effectiveness
AI coding agents are not equally effective across all task sizes. Based on METR time horizon research (24k runs, 228 tasks, Jan 2025–Feb 2026):
- Agents achieve ~100% success on tasks taking humans <4 minutes
- Success drops to <10% on tasks taking humans >4 hours
- The agent capability "time horizon" doubles every ~7 months (accelerated to ~4 months in 2024-2025)
Our effectiveness values measure work acceleration — how much the agent speeds up the overall task — not autonomous completion rate. METR's autonomous success rates (from 24k runs) are lower: S≈0.83, M≈0.25, L≈0.15, XL≈0.22 for frontier models. Agents that fail to complete autonomously still produce useful partial output (scaffolding, research, partial implementation). Our values sit between METR's autonomous rates and 1.0, reflecting this partial-credit reality.
This is supported by PairCoder's data (400+ tasks): agent execution time is 5-15 minutes regardless of task size, but human review/correction scales with complexity. The agent always contributes — the question is how much human intervention is needed.
| Complexity | Agent Effectiveness | What This Means |
|---|---|---|
| S | 90% | Agent handles most of the work. Minimal human correction needed. |
| M | 50% | Agent and human split the work. Significant human intervention for edge cases. |
| L | 35% | Mostly human-driven. Agent assists with subtasks, human steers direction. |
| XL | 30% | Human-driven with agent assist. Agent handles subtasks, human drives architecture. |
Agent effectiveness adjusts the human fix ratio — how much of the agent's output needs manual correction:
adjusted_fix_ratio = base_fix_ratio + (1 - agent_effectiveness) × 0.3
Examples:
-
S task (0.9 effectiveness):
0.20 + (1 - 0.9) × 0.3 = 0.23→ 23% fix rate -
M task (0.5 effectiveness):
0.20 + (1 - 0.5) × 0.3 = 0.35→ 35% fix rate -
L task (0.35 effectiveness):
0.20 + (1 - 0.35) × 0.3 = 0.395→ 40% fix rate -
XL task (0.3 effectiveness):
0.20 + (1 - 0.3) × 0.3 = 0.41→ 41% fix rate
This means larger tasks automatically allocate more human time, which matches real-world experience.
A key finding from METR's 2025 randomized controlled trial:
Developers using AI tools took 19% longer on complex, familiar codebases — despite believing they were 20% faster.
This doesn't mean agents aren't useful. It means:
- Agent overhead (prompting, reviewing, correcting) can exceed time saved on complex work
- Agents excel at well-scoped, smaller tasks
- The benefit is in throughput of small tasks, not acceleration of large ones
- Don't assume agents make everything faster. For XL tasks, the agent is an assistant, not a driver.
- Break large tasks into smaller ones. Each S/M subtask gets high agent effectiveness.
- Agent effectiveness improves over time. As models improve, these ratios should be recalibrated. See Calibration.
A key reason effectiveness drops non-linearly with task size is compounding per-step error:
- 90% per-step accuracy across 10 steps = 35% end-to-end success
- 95% per-step across 5 steps = 77% success
Larger tasks have more steps where errors compound. This is why breaking XL tasks into S/M subtasks is so effective — each subtask gets high per-step accuracy. See O'Reilly and Prodigal.
The METR time horizon analysis shows agent capabilities doubling every ~7 months (accelerated to ~4 months recently). Specific model time horizons at 50% success: Claude Opus 4.6 ~14.5 hrs, Claude Opus 4.5 ~4h49m, GPT-5 ~2h17m.
These effectiveness values will need periodic recalibration as models improve. Track your team's actual agent effectiveness per size in the Calibration feedback loop.
Getting Started
Core Concepts
- How It Works
- Task Types
- Agent Effectiveness
- Confidence Levels
- Cone of Uncertainty
- PERT Statistics
- Small Council
Reference
Accuracy
Contributors