The open benchmark for AI agent task execution.
We don't test how well agents chat. We test how well they get things done.
AgentBench-Live evaluates AI coding agents on real-world tasks — writing code, analyzing data, orchestrating multi-step workflows, using tools, and conducting research — inside sandboxed environments, then scores them automatically.
No vibes. No self-reported evals. Just results.
Live Leaderboard | Full 10-task benchmark results:
Scores will be updated after the next full benchmark run with Docker sandbox + LLM Judge scoring across all 4 agents.
| Agent | CLI | Status |
|---|---|---|
| Claude Code | claude |
Benchmarked |
| Gemini CLI | gemini |
Benchmarked |
| Codex CLI | codex |
Ready |
| Aider | aider |
Ready |
Task (YAML) → Docker Sandbox → Agent Execution → Auto-Eval + LLM Judge → Score
- Task — A structured challenge with inputs, environment setup, and expected outcomes
- Sandbox — Docker prepares a clean workspace with dependencies installed. Falls back to local tempdir if Docker is unavailable.
- Agent — Receives the task prompt and works autonomously in the workspace
- Evaluator — Scores the output using automated tests, LLM-as-Judge, or both
- Ranking — Scores aggregated per domain and overall
See the full methodology for details on task design, scoring, and reproducibility.
# Install
pip install agentbench-live
# Run the full benchmark against an agent
agentbench run --agent claude-code --tasks all
# Run a single domain
agentbench run --agent aider --domain code
# View the leaderboard
agentbench leaderboard
# Generate a social comparison card
agentbench social-card --output comparison.png| Domain | What We Test | How We Score |
|---|---|---|
| Code | Bug fixes, feature implementation, refactoring | Test pass rate |
| Data | CSV/JSON analysis, insight generation | Accuracy + insight quality |
| Multi-step | Complex workflows across multiple tools | End-to-end success |
| Research | Technical investigation, comparison reports | LLM-as-Judge |
| Tool Use | API calls, CLI tools, file operations | Success rate |
Any CLI-based agent can be added in ~50 lines of Python:
from agentbench.adapters.base import AgentAdapter
from agentbench.adapters.registry import register_adapter
@register_adapter
class YourAgentAdapter(AgentAdapter):
name = "your-agent"
cli_command = "your-agent-cli"
api_key_env_var = "YOUR_API_KEY"
def _build_command(self, prompt: str) -> list[str]:
return ["your-agent-cli", "--prompt", prompt]Then run:
agentbench run --agent your-agent --tasks code-001Submit a PR with your adapter + results, and your agent joins the leaderboard.
See CONTRIBUTING.md for the full guide.
agentbench-live/
├── src/agentbench/ # Core framework
│ ├── adapters/ # Agent adapters (claude-code, gemini-cli, codex-cli, aider)
│ ├── evaluator/ # Scoring (auto-eval, LLM judge, composite)
│ ├── sandbox.py # Docker + local sandbox
│ ├── runner.py # Benchmark orchestrator
│ └── cli.py # CLI entry point
├── tasks/ # Benchmark task definitions (YAML)
├── leaderboard/ # Static frontend (GitHub Pages)
├── docs/ # Methodology & guides
└── tests/ # 183 tests, 90% coverage
- New Tasks — Submit benchmark tasks via PR (task authoring guide)
- New Adapters — Add support for your favorite agent
- Evaluator Improvements — Better scoring heuristics and judges
MIT
Built with the belief that the best way to improve agents is to measure them honestly.