A benchmark suite for evaluating OpenClaw agent performance across 7 scenarios and 63 tasks.
Each task has:
- An instruction sent to the openclaw agent
- An environment setup that seeds files/data in the workspace
- A reference solution (
solve.sh) that can be run to verify correctness - A verifier (
test.sh) that checks the agent's output and writes0or1to$REWARD_DIR/reward.txt
The runner sends each task to openclaw agent as a local subprocess and scores the result.
- Python 3.13+
uvpackage manageropenclawCLI installed and on$PATH
uv sync# List all tasks
uv run python run.py --list
# Run all scenarios (63 tasks)
uv run python run.py --all
# Run a single scenario
uv run python run.py --scenario file
uv run python run.py --scenario weather
uv run python run.py --scenario web
uv run python run.py --scenario summarize
uv run python run.py --scenario github
uv run python run.py --scenario gmail
uv run python run.py --scenario compound
# Run a single task
uv run python run.py --task tasks/file/file-organization
# Filter by difficulty
uv run python run.py --scenario file --difficulty easy
# Verify reference solutions pass
uv run python run.py --verify-only --scenario file
# Export results to JSON
uv run python run.py --all -o results.json| Scenario | Tasks | What It Tests |
|---|---|---|
file |
9 | File creation, transformation, log analysis, data pipelines |
weather |
9 | Current weather, forecasts, multi-city comparisons via web_fetch |
web |
9 | Fact retrieval from live web sources (PyPI, npm, GitHub, Wikipedia) |
summarize |
9 | Document summarization, comparison, action item extraction |
github |
9 | GitHub public API: repo stats, languages, issues, stars |
gmail |
9 | Local email parsing: counting, filtering, extracting, summarizing |
compound |
9 | Multi-step tasks combining file operations + web fetching |
Configuration via .env or environment variables:
AGENT_ID=main # OpenClaw agent to use (default: main)
TIMEOUT_MULTIPLIER=1.0 # Scale all timeouts (use >1 on slow machines)
BOT_WORKSPACE_PATH=/tmp/openclaw_benchmark # Local workspace pathopenclawbench/
├── run.py # CLI entry point
├── task_runner.py # TaskRunner, LocalBackend, DaytonaBackend
├── config.py # Settings (pydantic, from .env)
├── CLAUDE.md # Dev notes
└── tasks/
├── file/ # 9 file manipulation tasks
├── weather/ # 9 weather tasks
├── web/ # 9 web lookup tasks
├── summarize/ # 9 summarization tasks
├── github/ # 9 GitHub API tasks
├── gmail/ # 9 email parsing tasks
└── compound/ # 9 multi-step tasks
Each task follows the structure:
tasks/<scenario>/<task-name>/
├── task.toml # Metadata (difficulty, timeout, allow_internet)
├── instruction.md # Task prompt sent to the agent
├── environment/setup_workspace.py # Seeds files/data before the task
├── solution/solve.sh # Reference solution
└── tests/test.sh # Verifier