OpenClaw Benchmark

A benchmark suite for evaluating OpenClaw agent performance across 7 scenarios and 63 tasks.

How It Works

Each task has:

An instruction sent to the openclaw agent
An environment setup that seeds files/data in the workspace
A reference solution (solve.sh) that can be run to verify correctness
A verifier (test.sh) that checks the agent's output and writes 0 or 1 to $REWARD_DIR/reward.txt

The runner sends each task to openclaw agent as a local subprocess and scores the result.

Prerequisites

Python 3.13+
uv package manager
openclaw CLI installed and on $PATH

Installation

uv sync

Running Benchmarks

# List all tasks
uv run python run.py --list

# Run all scenarios (63 tasks)
uv run python run.py --all

# Run a single scenario
uv run python run.py --scenario file
uv run python run.py --scenario weather
uv run python run.py --scenario web
uv run python run.py --scenario summarize
uv run python run.py --scenario github
uv run python run.py --scenario gmail
uv run python run.py --scenario compound

# Run a single task
uv run python run.py --task tasks/file/file-organization

# Filter by difficulty
uv run python run.py --scenario file --difficulty easy

# Verify reference solutions pass
uv run python run.py --verify-only --scenario file

# Export results to JSON
uv run python run.py --all -o results.json

Scenarios

Scenario	Tasks	What It Tests
`file`	9	File creation, transformation, log analysis, data pipelines
`weather`	9	Current weather, forecasts, multi-city comparisons via web_fetch
`web`	9	Fact retrieval from live web sources (PyPI, npm, GitHub, Wikipedia)
`summarize`	9	Document summarization, comparison, action item extraction
`github`	9	GitHub public API: repo stats, languages, issues, stars
`gmail`	9	Local email parsing: counting, filtering, extracting, summarizing
`compound`	9	Multi-step tasks combining file operations + web fetching

Configuration

Configuration via .env or environment variables:

AGENT_ID=main               # OpenClaw agent to use (default: main)
TIMEOUT_MULTIPLIER=1.0      # Scale all timeouts (use >1 on slow machines)
BOT_WORKSPACE_PATH=/tmp/openclaw_benchmark  # Local workspace path

Project Structure

openclawbench/
├── run.py              # CLI entry point
├── task_runner.py      # TaskRunner, LocalBackend, DaytonaBackend
├── config.py           # Settings (pydantic, from .env)
├── CLAUDE.md           # Dev notes
└── tasks/
    ├── file/           # 9 file manipulation tasks
    ├── weather/        # 9 weather tasks
    ├── web/            # 9 web lookup tasks
    ├── summarize/      # 9 summarization tasks
    ├── github/         # 9 GitHub API tasks
    ├── gmail/          # 9 email parsing tasks
    └── compound/       # 9 multi-step tasks

Each task follows the structure:

tasks/<scenario>/<task-name>/
├── task.toml                       # Metadata (difficulty, timeout, allow_internet)
├── instruction.md                  # Task prompt sent to the agent
├── environment/setup_workspace.py  # Seeds files/data before the task
├── solution/solve.sh               # Reference solution
└── tests/test.sh                   # Verifier

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
tasks		tasks
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
config.py		config.py
pyproject.toml		pyproject.toml
run.py		run.py
task_runner.py		task_runner.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenClaw Benchmark

How It Works

Prerequisites

Installation

Running Benchmarks

Scenarios

Configuration

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenClaw Benchmark

How It Works

Prerequisites

Installation

Running Benchmarks

Scenarios

Configuration

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages