skill-bench

TOML-based Claude skill test runner (nextest-style)

Overview

skill-bench is a Rust-based test runner that executes test cases defined in TOML files. Designed for testing Claude Code skills.

Distributable as a standalone binary - users don't need a Rust environment.

Features

TOML test definitions: Write tests in TOML without writing Rust code
nextest-style interface: Parallel execution, filtering, failure history management
Fast parallel execution: Multi-threaded execution via rayon
Rich assertions: 18 verification types (skills, MCP, tools, files, databases)
Embedded assets: Harness plugin embedded in binary for standalone distribution

Installation

Quick install (Linux / macOS)

curl -fsSL https://raw.githubusercontent.com/sonesuke/skill-bench/main/scripts/setup.sh | sh

Supports Linux (x86_64, aarch64) and macOS (aarch64).

Build from source

cargo build --release
cargo install --path .

Usage

List tests

skill-bench list

Run tests

# Run all tests (uses default pattern: cases)
skill-bench run

# Run tests from specific directory (automatically finds all .toml files)
skill-bench run "cases"
skill-bench run "cases/concept-interviewing"

# Or use explicit glob pattern
skill-bench run "cases/**/*.toml"

# Filter by test name (regex)
skill-bench run --filter "functional-.*"

# Filter by skill name
skill-bench run --skill investigation-recording

# Rerun only failed tests
skill-bench run --rerun-failed

# Specify parallel threads
skill-bench run --threads 4

# Specify plugin directory (path to directory containing .claude-plugin/)
skill-bench run --plugin-dir ./patent-kit

# Persist Claude session logs to directory
skill-bench run --log logs    # Save logs to logs/ directory
skill-bench run -l .          # Short form: save to current directory

Display Timeline

# Display timeline of test execution from log file
skill-bench timeline /path/to/log-file.log

# Show detailed output
skill-bench timeline /path/to/log-file.log --verbose

Test Pattern Syntax

The pattern argument accepts either a directory or glob pattern:

cases - All TOML files recursively under cases/ (directory mode)
cases/concept-interviewing - All TOML files in a specific subdirectory
cases/**/*.toml - Explicit glob pattern for all TOML files recursively
cases/*/*.toml - TOML files only in immediate subdirectories
cases/functional-*.toml - TOML files matching a specific pattern

When a directory is specified, all .toml files are automatically found recursively.

List tests

# List all tests (uses default pattern: cases)
skill-bench list

# List tests from specific directory
skill-bench list "cases/concept-interviewing"

Directory Structure

skill-bench/
├── Cargo.toml              # Rust project config
├── src/                   # Source code
│   ├── main.rs           # Entry point
│   ├── cli/              # CLI argument definitions
│   ├── models/           # Data models
│   ├── runtime/          # Test discovery & execution
│   ├── assertions/       # Assertion library
│   ├── output/           # Result reporting
│   ├── state/            # Failure history management
│   └── assets/           # Embedded assets
│       └── harness-plugin/
├── cases/                # TOML test cases
│   ├── claim-analyzing/
│   ├── concept-interviewing/
│   └── ...
└── target/               # Build output

TOML Test Format

name = "test-name"
description = "Test description"
timeout = 120  # seconds

test_prompt = """
Test prompt here...
"""

# Setup steps (executed in order before the test)
[[setup]]
name = "create-sample-file"
path = "test.txt"
content = "File content"

[[setup]]
name = "prepare-directory"
command = "mkdir -p subdir && echo 'done' > subdir/file.txt"

[[checks]]
name = "check_name"
command = { command = "skill-invoked", skill = "skill-name" }

[answers]
"question_key" = "answer_value"

Setup Reference

Setup steps run in the test workspace before the test prompt is executed. Steps are executed in the order they appear. If any step fails, the entire test fails.

Two types are supported, distinguished automatically by the fields present:

File (path + content) — Creates a file (and any missing parent directories) in the workspace
Script (command) — Executes a shell command via bash -c in the workspace

# File setup
[[setup]]
name = "optional-descriptive-name"
path = "dir/subdir/file.txt"
content = "File content here"

# Script setup
[[setup]]
name = "optional-descriptive-name"
command = "echo 'Hello' > greeting.txt && mkdir -p output"

Check Reference

Run skill-bench help <type> for detailed help on any check type.

Skill Verification

skill-loaded — Skill was loaded
skill-invoked — Skill was invoked

[[checks]]
name = "check-name"
command = { command = "skill-invoked", skill = "my-skill" }

MCP Verification

mcp-loaded — MCP server was loaded
mcp-tool-invoked — MCP tool was invoked
mcp-success — MCP tool succeeded

[[checks]]
name = "check-name"
command = { command = "mcp-loaded", server = "filesystem" }

Tool Verification

tool-use — Tool was called (partial match)
tool-param — Tool was called with a specific parameter

[[checks]]
name = "check-name"
command = { command = "tool-use", tool = "Read" }

[[checks]]
name = "check-param"
command = { command = "tool-param", tool = "Read", param = "file_path", value = "test.txt" }

File Verification

workspace-file — File exists in workspace
workspace-dir — Directory exists in workspace
file-contains — File contains string

[[checks]]
name = "check-name"
command = { command = "workspace-file", path = "output.txt" }

[[checks]]
name = "check-content"
command = { command = "file-contains", file = "output.txt", contains = "expected text" }

Log Verification

log-contains — Log contains regex pattern
message-contains — Assistant output contains text

[[checks]]
name = "check-name"
command = { command = "log-contains", pattern = "error|failed" }

[[checks]]
name = "check-output"
command = { command = "message-contains", text = "expected output" }

Database Verification

db-query — SQL query result verification
- Numeric comparisons: ">0", ">=5", "=10", "<3", "<=2"

[[checks]]
name = "check-name"
command = { command = "db-query", db = "patents.db", query = "SELECT COUNT(*) FROM patents", expected = ">0" }

Negative Assertions

Use deny = true on any check to invert the assertion:

[[checks]]
name = "should-not-contain-error"
command = { command = "file-contains", file = "output.txt", contains = "error" }
deny = true

Failure History

Failed test history is saved to .skill-bench/test-history.json. Use --rerun-failed to re-run only tests that failed in the previous run.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.devcontainer		.devcontainer
.github		.github
scripts		scripts
src		src
tests/fixtures/logs		tests/fixtures/logs
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Cargo.toml		Cargo.toml
README.md		README.md
mise.toml		mise.toml
tarpaulin-report.html		tarpaulin-report.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skill-bench

Overview

Features

Installation

Quick install (Linux / macOS)

Build from source

Usage

List tests

Run tests

Display Timeline

Test Pattern Syntax

List tests

Directory Structure

TOML Test Format

Setup Reference

Check Reference

Skill Verification

MCP Verification

Tool Verification

File Verification

Log Verification

Database Verification

Negative Assertions

Failure History

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

skill-bench

Overview

Features

Installation

Quick install (Linux / macOS)

Build from source

Usage

List tests

Run tests

Display Timeline

Test Pattern Syntax

List tests

Directory Structure

TOML Test Format

Setup Reference

Check Reference

Skill Verification

MCP Verification

Tool Verification

File Verification

Log Verification

Database Verification

Negative Assertions

Failure History

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages