openclaw-model-eval

Model capability test suite for OpenClaw — benchmark any model across 7 tiers from pure reasoning to full agentic pipeline.

Developed by The AI Horizon while validating local Qwen models as zero-cost alternatives to cloud APIs.

What This Is

A structured benchmark for OpenClaw agents. Each test runs in an isolated session (no bleed between tests) and checks real capabilities — not simulated ones. The model must actually call tools, handle failures, and produce structured output.

Validated Results (qwen3.5:35b-nothink, 2026-03-12)

Tier	Name	Score
0	Pure Reasoning	6/6 ✅
1	Single Tool Calls	9/9 ✅
2	Multi-Step Single Domain	5/5 ✅
3	Cross-Tool Agentic	2/2 ✅
4	Full Research Pipeline	1/1 ✅
5	Adversarial / Edge Cases	5/5 ✅
6	Domain-Specific	4/4 ✅
—	Stress Test (Tier 0+1 × 3 runs)	36/36 ✅
Total		32/32 ✅

Full write-up: How to run OpenClaw with a local Qwen model

Tiers

Tier	Name	Tests	What It Validates
0	Pure Reasoning	6	Logic, math, JSON, instruction following — no tools
1	Single Tool	9	Each OpenClaw tool called once correctly
2	Multi-Step Single Domain	5	Chained tool use in one domain
3	Cross-Tool Agentic	2	Complex tasks spanning multiple tools
4	Full Research Pipeline	1	End-to-end: search → memory → exec → report
5	Adversarial / Edge Cases	5	Ambiguity handling, conflict resolution, failure recovery, refusal
6	Domain-Specific (AI Horizon)	4	Evidence classification, forecast scoring, DCWF mapping

Requirements & Prerequisites

What every tier needs

OpenClaw installed and configured
openclaw CLI available in PATH
A configured agent (default: main) with at least one working model

Start here: Tier 0 only needs OpenClaw

Tier 0 is pure reasoning — no tools, no external services. If you just want to check whether a model can think, start here:

python3 model-eval.py --tier 0 --label "reasoning-only"

No other setup required.

Tool requirements by tier

Each tier adds tools. Here's exactly what each one needs and how to verify it works before running.

Tier 1 — Single Tool Calls

Test	Tool	What you need	Verify
T1.1, T1.2	`exec` (shell)	`exec` skill enabled in your agent	`openclaw exec --agent main "run: echo hello"`
T1.3	`memory_search` (QMD)	QMD installed + workspace indexed	`qmd search "test"` returns results
T1.4	`chromadb_search`	ChromaDB running on port 8100, collection `longterm_memory` exists	`curl http://localhost:8100/api/v2/collections`
T1.5	`web_fetch`	Internet access, `web_fetch` skill enabled	Skill in your agent's skill list
T1.6	`web_search` via SearXNG	SearXNG running — see SearXNG setup below	`curl http://YOUR_SEARXNG_HOST:PORT/search?q=test&format=json`
T1.7	`exec` + remindctl	macOS only — Reminders app + remindctl or equivalent	`remindctl list` returns output
T1.8, T1.9	`gog` (Google Sheets)	`gog` CLI installed + Google account authenticated + `EVAL_TEST_SHEET_ID` set	`gog sheets get $EVAL_TEST_SHEET_ID "Sheet1!A1"`

Skip tests you can't support using --test to run only specific IDs, or --fast to skip anything that requires external services.

Tier 2 — Multi-Step

T2.1–T2.3: Only exec needed
T2.4: web_fetch (internet)
T2.5: SearXNG (see below)

Tier 3 — Cross-Tool Agentic

T3.1: gog (Google Sheets read + write)
T3.2: ChromaDB (write + read round-trip)

Both tests are destructive (write data) — requires --all flag and EVAL_TEST_SHEET_ID set.

Tier 4 — Full Pipeline

All three: SearXNG + memory_search (QMD) + exec. This is the most demanding tier.

Tier 5 — Adversarial

No external tools. Pure model behavior — ambiguity, conflict handling, refusal. Runs anywhere Tier 0 runs.

Tier 6 — Domain-Specific

No external tools. Structured reasoning tasks. Runs anywhere Tier 0 runs.

SearXNG setup

Repo: github.com/searxng/searxng | Docs: docs.searxng.org

SearXNG is a free, self-hosted meta search engine. It's required for T1.6, T2.5, T3.x, and T4.1. You host it yourself — no API key, no rate limits.

Option 1 — Docker (recommended, 60 seconds):

docker run -d \
  -p 4000:8080 \
  --name searxng \
  -e SEARXNG_SECRET=$(openssl rand -hex 32) \
  searxng/searxng

Option 2 — Docker Compose (persistent config):

git clone https://github.com/searxng/searxng-docker
cd searxng-docker
docker compose up -d

Enable JSON output (required for the eval — the API won't work without it):

Edit settings.yml inside your SearXNG container or volume and ensure:

search:
  formats:
    - html
    - json

Then restart: docker restart searxng

Verify it's working:

curl "http://localhost:4000/search?q=test&format=json" | python3 -m json.tool | head -10
# Should return {"query": "test", "results": [...]}

If you get {"error": "..."} or a 403, JSON output is not enabled yet.

Update the URL in the tests — the script uses a placeholder YOUR_SEARXNG_HOST:PORT. Replace it before running:

# macOS
sed -i '' 's|YOUR_SEARXNG_HOST:PORT|localhost:4000|g' model-eval.py

# Linux
sed -i 's|YOUR_SEARXNG_HOST:PORT|localhost:4000|g' model-eval.py

ChromaDB setup

Repo: github.com/chroma-core/chroma | Docs: docs.trychroma.com

ChromaDB is an open-source vector database used by OpenClaw for long-term conversational memory. Required for T1.4 (search) and T3.2 (write + read round-trip).

Install:

pip install chromadb

Run on port 8100 (the port OpenClaw and these tests expect):

chroma run --host 0.0.0.0 --port 8100

If you want it to run persistently in the background, use a tool like screen, tmux, or create a systemd/launchd service.

Create the required collection — the tests look for a collection named exactly longterm_memory. Create it once:

python3 -c "
import chromadb
client = chromadb.HttpClient(host='localhost', port=8100)
client.get_or_create_collection('longterm_memory')
print('Collection ready')
"

Verify both the server and collection:

# Server up?
curl http://localhost:8100/api/v2/heartbeat
# {"nanosecond heartbeat": ...}

# Collection exists?
curl http://localhost:8100/api/v2/collections
# Should include "longterm_memory" in the response

Note: The tests use ChromaDB v2 API (/api/v2/). Make sure you're running ChromaDB 0.5.0 or later.

QMD (memory_search) setup

Package: @tobilu/qmd on npm

QMD is OpenClaw's workspace memory system — BM25 + vector search over your markdown files. Required for T1.3 (memory_search) and T4.1 (full pipeline).

Install:

npm install -g @tobilu/qmd

Index your workspace:

# Point at a directory containing .md files
qmd update   # scans for markdown files
qmd embed    # runs embeddings (downloads a ~329MB model on first run)

Verify:

qmd search "test query"
# Should return results if you have indexed content

If T1.3 returns "no results found" — that's expected if your workspace has no content. Either skip this test or add a few .md files and re-run qmd update && qmd embed.

gog (Google Sheets) setup

Package: @tobilu/gog on npm

gog is a Google Workspace CLI (Sheets, Drive, Gmail, Calendar). Required for T1.8, T1.9 (Sheets read/write), and T3.1 (cross-tool Sheets task). Uses OAuth — no API key needed, just a Google account.

Install:

npm install -g @tobilu/gog

Authenticate:

gog auth login
# Opens browser for Google OAuth — sign in with the account that owns your sheet

Create a throwaway test sheet (don't point at a real sheet — tests write rows):

gog sheets create "OpenClaw Eval Test Sheet"
# Output includes the sheet ID, e.g.:
# Created: https://docs.google.com/spreadsheets/d/1abc.../edit
#                                                  ^^^^^ this is EVAL_TEST_SHEET_ID

export EVAL_TEST_SHEET_ID=1abc...

Verify auth is working:

gog sheets get $EVAL_TEST_SHEET_ID "Sheet1!A1"
# Should return the cell value (empty is fine for a new sheet)

Quickstart: what to run based on your setup

Your setup	Command
Just OpenClaw, no extras	`python3 model-eval.py --tier 0 5 6 --label "baseline"`
OpenClaw + exec skill	`python3 model-eval.py --tier 0 1 2 --fast --label "no-network"`
Full stack (SearXNG + ChromaDB + QMD)	`python3 model-eval.py --tier 0 1 2 3 4 --label "full"`
Everything including writes	`python3 model-eval.py --all --label "complete"`

The --fast flag skips any test that requires an external network call (SearXNG, web_fetch, Sheets).

Usage

# Run baseline (Tier 0 + 1) — fast, no external calls
python3 model-eval.py --tier 0 1 --fast --label "baseline"

# Run all tiers
python3 model-eval.py --label "full-run"

# Test a specific model (swaps primary, restores after)
python3 model-eval.py --tier 0 1 2 --model "localgpu/qwen3.5:35b-nothink" --label "qwen-test"

# Stress test — run 3x for reliability baseline
python3 model-eval.py --tier 0 1 --repeat 3 --label "stress-test"

# Adversarial tests
python3 model-eval.py --tier 5 --label "adversarial"

# AI Horizon domain tests
python3 model-eval.py --tier 6 --label "domain"

# Specific test IDs
python3 model-eval.py --test T0.1 T1.3 T5.2

# Dry run — see what would run without executing
python3 model-eval.py --dry-run

# Compare two runs
python3 model-eval.py --compare run-id-1 run-id-2

Configuration

--model flag

Temporarily swaps agents.list[id=main].model.primary in openclaw.json, restarts the gateway (~30s), runs the eval, then restores the original model. Gateway is unavailable during swap.

python3 model-eval.py --tier 0 1 --model "localgpu/qwen3.5:35b-nothink"

If interrupted mid-run, restore manually:

# Check current primary
grep -A2 '"id": "main"' ~/.openclaw/openclaw.json | grep primary
# Edit openclaw.json → agents.list[id=main].model.primary → your original model
# Then restart gateway
launchctl kickstart -k gui/$UID/ai.openclaw.gateway

Output

Results are saved to ~/.openclaw/workspace/eval-results/ as JSON + Markdown.

eval-results/
  20260312-114552-ee2646.json   # raw results with token counts, timings
  20260312-114552-ee2646.md     # formatted report

Post results to Slack after a run:

python3 model-eval.py --tier 0 1 --slack --label "weekly-check"

Adding Tests

Tests are defined in the TESTS list in model-eval.py. Each test is a dict:

{
    "id": "T5.6",              # unique ID
    "tier": 5,                 # tier number
    "name": "My test",         # short description
    "prompt": "...",           # sent to the agent
    "expect_contains": ["keyword1", "keyword2"],  # pass if ANY matched
    "expect_json": True,       # optional: validate JSON in response
    "timeout": 180,            # seconds
    "fast": True,              # True = no external API calls
    "destructive": False,      # True = skipped unless --all
}

Pass logic: test passes if at least one expect_contains keyword appears in the response (case-insensitive).

Stress Testing

Use --repeat N to run the same tests multiple times and measure consistency:

python3 model-eval.py --tier 0 1 --repeat 5 --label "reliability"

Each repeat gets a fresh isolated session. Results show all N runs — look for any failures across repeats to identify flaky behavior.

Background

This eval suite was built while getting qwen3.5:35b-nothink working reliably in OpenClaw. The key discovery: qwen3.x models in streaming mode emit output in the reasoning field, not content, causing OpenClaw to silently fall through to the next model. A thin Ollama proxy fixes this.

Full write-up: gist.github.com/TheAIHorizon/37c30e375f2ce08e726e4bb6347f26b1

Maintained by The AI Horizon — forecasting AI's impact on the workforce.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
model-eval.py		model-eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

openclaw-model-eval

What This Is

Validated Results (qwen3.5:35b-nothink, 2026-03-12)

Tiers

Requirements & Prerequisites

What every tier needs

Start here: Tier 0 only needs OpenClaw

Tool requirements by tier

Tier 1 — Single Tool Calls

Tier 2 — Multi-Step

Tier 3 — Cross-Tool Agentic

Tier 4 — Full Pipeline

Tier 5 — Adversarial

Tier 6 — Domain-Specific

SearXNG setup

ChromaDB setup

QMD (memory_search) setup

gog (Google Sheets) setup

Quickstart: what to run based on your setup

Usage

Configuration

--model flag

Output

Adding Tests

Stress Testing

Background

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

openclaw-model-eval

What This Is

Validated Results (qwen3.5:35b-nothink, 2026-03-12)

Tiers

Requirements & Prerequisites

What every tier needs

Start here: Tier 0 only needs OpenClaw

Tool requirements by tier

Tier 1 — Single Tool Calls

Tier 2 — Multi-Step

Tier 3 — Cross-Tool Agentic

Tier 4 — Full Pipeline

Tier 5 — Adversarial

Tier 6 — Domain-Specific

SearXNG setup

ChromaDB setup

QMD (memory_search) setup

gog (Google Sheets) setup

Quickstart: what to run based on your setup

Usage

Configuration

--model flag

Output

Adding Tests

Stress Testing

Background

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages