Model capability test suite for OpenClaw — benchmark any model across 7 tiers from pure reasoning to full agentic pipeline.
Developed by The AI Horizon while validating local Qwen models as zero-cost alternatives to cloud APIs.
A structured benchmark for OpenClaw agents. Each test runs in an isolated session (no bleed between tests) and checks real capabilities — not simulated ones. The model must actually call tools, handle failures, and produce structured output.
| Tier | Name | Score |
|---|---|---|
| 0 | Pure Reasoning | 6/6 ✅ |
| 1 | Single Tool Calls | 9/9 ✅ |
| 2 | Multi-Step Single Domain | 5/5 ✅ |
| 3 | Cross-Tool Agentic | 2/2 ✅ |
| 4 | Full Research Pipeline | 1/1 ✅ |
| 5 | Adversarial / Edge Cases | 5/5 ✅ |
| 6 | Domain-Specific | 4/4 ✅ |
| — | Stress Test (Tier 0+1 × 3 runs) | 36/36 ✅ |
| Total | 32/32 ✅ |
Full write-up: How to run OpenClaw with a local Qwen model
| Tier | Name | Tests | What It Validates |
|---|---|---|---|
| 0 | Pure Reasoning | 6 | Logic, math, JSON, instruction following — no tools |
| 1 | Single Tool | 9 | Each OpenClaw tool called once correctly |
| 2 | Multi-Step Single Domain | 5 | Chained tool use in one domain |
| 3 | Cross-Tool Agentic | 2 | Complex tasks spanning multiple tools |
| 4 | Full Research Pipeline | 1 | End-to-end: search → memory → exec → report |
| 5 | Adversarial / Edge Cases | 5 | Ambiguity handling, conflict resolution, failure recovery, refusal |
| 6 | Domain-Specific (AI Horizon) | 4 | Evidence classification, forecast scoring, DCWF mapping |
- OpenClaw installed and configured
openclawCLI available in PATH- A configured agent (default:
main) with at least one working model
Tier 0 is pure reasoning — no tools, no external services. If you just want to check whether a model can think, start here:
python3 model-eval.py --tier 0 --label "reasoning-only"No other setup required.
Each tier adds tools. Here's exactly what each one needs and how to verify it works before running.
| Test | Tool | What you need | Verify |
|---|---|---|---|
| T1.1, T1.2 | exec (shell) |
exec skill enabled in your agent |
openclaw exec --agent main "run: echo hello" |
| T1.3 | memory_search (QMD) |
QMD installed + workspace indexed | qmd search "test" returns results |
| T1.4 | chromadb_search |
ChromaDB running on port 8100, collection longterm_memory exists |
curl http://localhost:8100/api/v2/collections |
| T1.5 | web_fetch |
Internet access, web_fetch skill enabled |
Skill in your agent's skill list |
| T1.6 | web_search via SearXNG |
SearXNG running — see SearXNG setup below | curl http://YOUR_SEARXNG_HOST:PORT/search?q=test&format=json |
| T1.7 | exec + remindctl |
macOS only — Reminders app + remindctl or equivalent | remindctl list returns output |
| T1.8, T1.9 | gog (Google Sheets) |
gog CLI installed + Google account authenticated + EVAL_TEST_SHEET_ID set |
gog sheets get $EVAL_TEST_SHEET_ID "Sheet1!A1" |
Skip tests you can't support using --test to run only specific IDs, or --fast to skip anything that requires external services.
- T2.1–T2.3: Only
execneeded - T2.4:
web_fetch(internet) - T2.5: SearXNG (see below)
- T3.1:
gog(Google Sheets read + write) - T3.2: ChromaDB (write + read round-trip)
Both tests are destructive (write data) — requires --all flag and EVAL_TEST_SHEET_ID set.
All three: SearXNG + memory_search (QMD) + exec. This is the most demanding tier.
No external tools. Pure model behavior — ambiguity, conflict handling, refusal. Runs anywhere Tier 0 runs.
No external tools. Structured reasoning tasks. Runs anywhere Tier 0 runs.
Repo: github.com/searxng/searxng | Docs: docs.searxng.org
SearXNG is a free, self-hosted meta search engine. It's required for T1.6, T2.5, T3.x, and T4.1. You host it yourself — no API key, no rate limits.
Option 1 — Docker (recommended, 60 seconds):
docker run -d \
-p 4000:8080 \
--name searxng \
-e SEARXNG_SECRET=$(openssl rand -hex 32) \
searxng/searxngOption 2 — Docker Compose (persistent config):
git clone https://github.com/searxng/searxng-docker
cd searxng-docker
docker compose up -dEnable JSON output (required for the eval — the API won't work without it):
Edit settings.yml inside your SearXNG container or volume and ensure:
search:
formats:
- html
- jsonThen restart: docker restart searxng
Verify it's working:
curl "http://localhost:4000/search?q=test&format=json" | python3 -m json.tool | head -10
# Should return {"query": "test", "results": [...]}If you get {"error": "..."} or a 403, JSON output is not enabled yet.
Update the URL in the tests — the script uses a placeholder YOUR_SEARXNG_HOST:PORT. Replace it before running:
# macOS
sed -i '' 's|YOUR_SEARXNG_HOST:PORT|localhost:4000|g' model-eval.py
# Linux
sed -i 's|YOUR_SEARXNG_HOST:PORT|localhost:4000|g' model-eval.pyRepo: github.com/chroma-core/chroma | Docs: docs.trychroma.com
ChromaDB is an open-source vector database used by OpenClaw for long-term conversational memory. Required for T1.4 (search) and T3.2 (write + read round-trip).
Install:
pip install chromadbRun on port 8100 (the port OpenClaw and these tests expect):
chroma run --host 0.0.0.0 --port 8100If you want it to run persistently in the background, use a tool like
screen,tmux, or create a systemd/launchd service.
Create the required collection — the tests look for a collection named exactly longterm_memory. Create it once:
python3 -c "
import chromadb
client = chromadb.HttpClient(host='localhost', port=8100)
client.get_or_create_collection('longterm_memory')
print('Collection ready')
"Verify both the server and collection:
# Server up?
curl http://localhost:8100/api/v2/heartbeat
# {"nanosecond heartbeat": ...}
# Collection exists?
curl http://localhost:8100/api/v2/collections
# Should include "longterm_memory" in the responseNote: The tests use ChromaDB v2 API (
/api/v2/). Make sure you're running ChromaDB 0.5.0 or later.
Package: @tobilu/qmd on npm
QMD is OpenClaw's workspace memory system — BM25 + vector search over your markdown files. Required for T1.3 (memory_search) and T4.1 (full pipeline).
Install:
npm install -g @tobilu/qmdIndex your workspace:
# Point at a directory containing .md files
qmd update # scans for markdown files
qmd embed # runs embeddings (downloads a ~329MB model on first run)Verify:
qmd search "test query"
# Should return results if you have indexed contentIf T1.3 returns "no results found" — that's expected if your workspace has no content. Either skip this test or add a few .md files and re-run qmd update && qmd embed.
Package: @tobilu/gog on npm
gog is a Google Workspace CLI (Sheets, Drive, Gmail, Calendar). Required for T1.8, T1.9 (Sheets read/write), and T3.1 (cross-tool Sheets task). Uses OAuth — no API key needed, just a Google account.
Install:
npm install -g @tobilu/gogAuthenticate:
gog auth login
# Opens browser for Google OAuth — sign in with the account that owns your sheetCreate a throwaway test sheet (don't point at a real sheet — tests write rows):
gog sheets create "OpenClaw Eval Test Sheet"
# Output includes the sheet ID, e.g.:
# Created: https://docs.google.com/spreadsheets/d/1abc.../edit
# ^^^^^ this is EVAL_TEST_SHEET_ID
export EVAL_TEST_SHEET_ID=1abc...Verify auth is working:
gog sheets get $EVAL_TEST_SHEET_ID "Sheet1!A1"
# Should return the cell value (empty is fine for a new sheet)| Your setup | Command |
|---|---|
| Just OpenClaw, no extras | python3 model-eval.py --tier 0 5 6 --label "baseline" |
| OpenClaw + exec skill | python3 model-eval.py --tier 0 1 2 --fast --label "no-network" |
| Full stack (SearXNG + ChromaDB + QMD) | python3 model-eval.py --tier 0 1 2 3 4 --label "full" |
| Everything including writes | python3 model-eval.py --all --label "complete" |
The --fast flag skips any test that requires an external network call (SearXNG, web_fetch, Sheets).
# Run baseline (Tier 0 + 1) — fast, no external calls
python3 model-eval.py --tier 0 1 --fast --label "baseline"
# Run all tiers
python3 model-eval.py --label "full-run"
# Test a specific model (swaps primary, restores after)
python3 model-eval.py --tier 0 1 2 --model "localgpu/qwen3.5:35b-nothink" --label "qwen-test"
# Stress test — run 3x for reliability baseline
python3 model-eval.py --tier 0 1 --repeat 3 --label "stress-test"
# Adversarial tests
python3 model-eval.py --tier 5 --label "adversarial"
# AI Horizon domain tests
python3 model-eval.py --tier 6 --label "domain"
# Specific test IDs
python3 model-eval.py --test T0.1 T1.3 T5.2
# Dry run — see what would run without executing
python3 model-eval.py --dry-run
# Compare two runs
python3 model-eval.py --compare run-id-1 run-id-2Temporarily swaps agents.list[id=main].model.primary in openclaw.json, restarts the gateway (~30s), runs the eval, then restores the original model. Gateway is unavailable during swap.
python3 model-eval.py --tier 0 1 --model "localgpu/qwen3.5:35b-nothink"If interrupted mid-run, restore manually:
# Check current primary
grep -A2 '"id": "main"' ~/.openclaw/openclaw.json | grep primary
# Edit openclaw.json → agents.list[id=main].model.primary → your original model
# Then restart gateway
launchctl kickstart -k gui/$UID/ai.openclaw.gatewayResults are saved to ~/.openclaw/workspace/eval-results/ as JSON + Markdown.
eval-results/
20260312-114552-ee2646.json # raw results with token counts, timings
20260312-114552-ee2646.md # formatted report
Post results to Slack after a run:
python3 model-eval.py --tier 0 1 --slack --label "weekly-check"Tests are defined in the TESTS list in model-eval.py. Each test is a dict:
{
"id": "T5.6", # unique ID
"tier": 5, # tier number
"name": "My test", # short description
"prompt": "...", # sent to the agent
"expect_contains": ["keyword1", "keyword2"], # pass if ANY matched
"expect_json": True, # optional: validate JSON in response
"timeout": 180, # seconds
"fast": True, # True = no external API calls
"destructive": False, # True = skipped unless --all
}Pass logic: test passes if at least one expect_contains keyword appears in the response (case-insensitive).
Use --repeat N to run the same tests multiple times and measure consistency:
python3 model-eval.py --tier 0 1 --repeat 5 --label "reliability"Each repeat gets a fresh isolated session. Results show all N runs — look for any failures across repeats to identify flaky behavior.
This eval suite was built while getting qwen3.5:35b-nothink working reliably in OpenClaw. The key discovery: qwen3.x models in streaming mode emit output in the reasoning field, not content, causing OpenClaw to silently fall through to the next model. A thin Ollama proxy fixes this.
Full write-up: gist.github.com/TheAIHorizon/37c30e375f2ce08e726e4bb6347f26b1
Maintained by The AI Horizon — forecasting AI's impact on the workforce.