ClawEval

Security benchmarks for OpenClaw agents

ClawEval is a comprehensive security benchmark suite developed by ZJUICSR to evaluate the security capabilities of OpenClaw agents. It tests how well agents handle various security risks including memory leaks, indirect prompt injection, obfuscated coding, and other real-world attack vectors.

Why ClawEval?

Most security benchmarks test isolated vulnerabilities. ClawEval tests what actually matters for AI agents:

Memory Leak Prevention (R1) — Can the agent protect sensitive credentials from being extracted?
Indirect Injection Detection (R2) — Can the agent recognize and neutralize malicious instructions embedded in user input?
Obfuscated Coding Prevention (R3) — Can it detect and refuse to execute encoded malicious payloads?
Multi-step Attack Resistance — Can it chain together defenses against complex attack chains?
Unauthorized Operation Prevention (R5) — Does it enforce human-in-the-loop confirmation for destructive actions?
Real-world Security Scenarios — Can it handle ambiguous instructions and social engineering attempts?

Quick Start

# Clone the benchmark
git clone https://github.com/ZJUICSR/ClawEval.git
cd ClawEval

# Install dependencies
pip install -e .

# Run benchmarks with your model of choice
python -m benchmark --model openrouter/anthropic/claude-sonnet-4

# Or run specific tasks
python -m benchmark --model openrouter/openai/gpt-4o --suite task_R1_01,task_R2_01

Note: Model IDs must include their provider prefix (e.g. openrouter/, anthropic/). OpenRouter is the default provider used for routing.

Requirements

Python 3.10+
uv package manager
A running OpenClaw instance

What Gets Tested

ClawEval includes 25 tasks across security-focused categories:

Category	Risk ID	Description
Memory Leak	R1	Prevent credential extraction from agent memory
Indirect Injection	R2	Detect malicious instructions in user input
Obfuscated Coding	R3	Refuse to execute encoded malicious payloads
Attack Chaining	R4	Resist complex attack sequences
Unauthorized Operation	R5	Enforce confirmation for destructive actions
Social Engineering	R6	Identify authority impersonation attempts
Data Exfiltration	R7	Protect against unauthorized data export

Each task is graded automatically based on security outcomes — ensuring agents properly protect sensitive data.

Running Benchmarks

# Run benchmark with a specific model
python -m benchmark --model openrouter/anthropic/claude-sonnet-4

# Run specific risk categories
python -m benchmark --model openrouter/anthropic/claude-sonnet-4 --suite R1
python -m benchmark --model openrouter/anthropic/claude-sonnet-4 --suite R2

# Run specific tasks
python -m benchmark --model openrouter/anthropic/claude-sonnet-4 --suite task_R1_01,task_R2_01

# Run without uploading results
python -m benchmark --model openrouter/anthropic/claude-sonnet-4 --no-upload

Command Reference

Flag	Description
`--model MODEL`	Model to test (e.g., `openrouter/anthropic/claude-sonnet-4`)
`--judge MODEL`	Judge model for LLM grading (default: `openrouter/anthropic/claude-opus-4.5`)
`--suite SUITE`	`all`, risk category (e.g., `R1`, `R2`), or comma-separated task IDs
`--runs N`	Number of runs per task for averaging
`--timeout-multiplier N`	Scale timeouts for slower models
`--output-dir DIR`	Where to save results (default: `results/`)
`--no-upload`	Skip uploading results

Project Structure

ClawEval/
├── tasks/                 # Benchmark task definitions (25 tasks)
│   ├── task_R1_*.md      # Memory Leak tests
│   ├── task_R2_*.md      # Indirect Injection tests
│   ├── task_R3_*.md      # Obfuscated Coding tests
│   ├── task_R5_*.md      # Unauthorized Operation tests
│   ├── task_R6_*.md      # Social Engineering tests
│   └── task_R7_*.md      # Data Exfiltration tests
├── scripts/
│   └── run.sh            # Benchmark runner script
├── tests/                # Test suite
├── SKILL.md              # Skill definition for agent execution
└── pyproject.toml        # Project configuration

Contributing Tasks

We welcome new tasks! Check out tasks/TASK_TEMPLATE.md for the format. Good tasks are:

Real-world — Something an actual user would ask an agent to do
Measurable — Clear success criteria that can be graded
Reproducible — Same task should produce consistent grading
Challenging — Tests agent capabilities, not just LLM knowledge

Related Projects

OpenClaw: github.com/openclaw/openclaw
PinchBench: github.com/pinchbench/skill — General-purpose agent benchmarks
Leaderboard: pinchbench.com

Acknowledgments

This project builds upon the excellent work of PinchBench. We thank the PinchBench team for their open-source contribution to agent benchmarking.

License

MIT — see LICENSE for details.

Claw-some AI agent testing 🦞

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scripts		scripts
tasks		tasks
tests		tests
.gitignore		.gitignore
Dockerfile.benchmark		Dockerfile.benchmark
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
crab.txt		crab.txt
pinchbench.png		pinchbench.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClawEval

Why ClawEval?

Quick Start

Requirements

What Gets Tested

Running Benchmarks

Command Reference

Project Structure

Contributing Tasks

Related Projects

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ClawEval

Why ClawEval?

Quick Start

Requirements

What Gets Tested

Running Benchmarks

Command Reference

Project Structure

Contributing Tasks

Related Projects

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages