Skip to content

sonesuke/skill-bench

Repository files navigation

skill-bench

TOML-based Claude skill test runner (nextest-style)

Overview

skill-bench is a Rust-based test runner that executes test cases defined in TOML files. Designed for testing Claude Code skills.

Distributable as a standalone binary - users don't need a Rust environment.

Features

  • TOML test definitions: Write tests in TOML without writing Rust code
  • nextest-style interface: Parallel execution, filtering, failure history management
  • Fast parallel execution: Multi-threaded execution via rayon
  • Rich assertions: 18 verification types (skills, MCP, tools, files, databases)
  • Embedded assets: Harness plugin embedded in binary for standalone distribution

Installation

Quick install (Linux / macOS)

curl -fsSL https://raw.githubusercontent.com/sonesuke/skill-bench/main/scripts/setup.sh | sh

Supports Linux (x86_64, aarch64) and macOS (aarch64).

Build from source

cargo build --release
cargo install --path .

Usage

List tests

skill-bench list

Run tests

# Run all tests (uses default pattern: cases)
skill-bench run

# Run tests from specific directory (automatically finds all .toml files)
skill-bench run "cases"
skill-bench run "cases/concept-interviewing"

# Or use explicit glob pattern
skill-bench run "cases/**/*.toml"

# Filter by test name (regex)
skill-bench run --filter "functional-.*"

# Filter by skill name
skill-bench run --skill investigation-recording

# Rerun only failed tests
skill-bench run --rerun-failed

# Specify parallel threads
skill-bench run --threads 4

# Specify plugin directory (path to directory containing .claude-plugin/)
skill-bench run --plugin-dir ./patent-kit

# Persist Claude session logs to directory
skill-bench run --log logs    # Save logs to logs/ directory
skill-bench run -l .          # Short form: save to current directory

Display Timeline

# Display timeline of test execution from log file
skill-bench timeline /path/to/log-file.log

# Show detailed output
skill-bench timeline /path/to/log-file.log --verbose

Test Pattern Syntax

The pattern argument accepts either a directory or glob pattern:

  • cases - All TOML files recursively under cases/ (directory mode)
  • cases/concept-interviewing - All TOML files in a specific subdirectory
  • cases/**/*.toml - Explicit glob pattern for all TOML files recursively
  • cases/*/*.toml - TOML files only in immediate subdirectories
  • cases/functional-*.toml - TOML files matching a specific pattern

When a directory is specified, all .toml files are automatically found recursively.

List tests

# List all tests (uses default pattern: cases)
skill-bench list

# List tests from specific directory
skill-bench list "cases/concept-interviewing"

Directory Structure

skill-bench/
├── Cargo.toml              # Rust project config
├── src/                   # Source code
│   ├── main.rs           # Entry point
│   ├── cli/              # CLI argument definitions
│   ├── models/           # Data models
│   ├── runtime/          # Test discovery & execution
│   ├── assertions/       # Assertion library
│   ├── output/           # Result reporting
│   ├── state/            # Failure history management
│   └── assets/           # Embedded assets
│       └── harness-plugin/
├── cases/                # TOML test cases
│   ├── claim-analyzing/
│   ├── concept-interviewing/
│   └── ...
└── target/               # Build output

TOML Test Format

name = "test-name"
description = "Test description"
timeout = 120  # seconds

test_prompt = """
Test prompt here...
"""

# Setup steps (executed in order before the test)
[[setup]]
name = "create-sample-file"
path = "test.txt"
content = "File content"

[[setup]]
name = "prepare-directory"
command = "mkdir -p subdir && echo 'done' > subdir/file.txt"

[[checks]]
name = "check_name"
command = { command = "skill-invoked", skill = "skill-name" }

[answers]
"question_key" = "answer_value"

Setup Reference

Setup steps run in the test workspace before the test prompt is executed. Steps are executed in the order they appear. If any step fails, the entire test fails.

Two types are supported, distinguished automatically by the fields present:

  • File (path + content) — Creates a file (and any missing parent directories) in the workspace
  • Script (command) — Executes a shell command via bash -c in the workspace
# File setup
[[setup]]
name = "optional-descriptive-name"
path = "dir/subdir/file.txt"
content = "File content here"

# Script setup
[[setup]]
name = "optional-descriptive-name"
command = "echo 'Hello' > greeting.txt && mkdir -p output"

Check Reference

Run skill-bench help <type> for detailed help on any check type.

Skill Verification

  • skill-loaded — Skill was loaded
  • skill-invoked — Skill was invoked
[[checks]]
name = "check-name"
command = { command = "skill-invoked", skill = "my-skill" }

MCP Verification

  • mcp-loaded — MCP server was loaded
  • mcp-tool-invoked — MCP tool was invoked
  • mcp-success — MCP tool succeeded
[[checks]]
name = "check-name"
command = { command = "mcp-loaded", server = "filesystem" }

Tool Verification

  • tool-use — Tool was called (partial match)
  • tool-param — Tool was called with a specific parameter
[[checks]]
name = "check-name"
command = { command = "tool-use", tool = "Read" }

[[checks]]
name = "check-param"
command = { command = "tool-param", tool = "Read", param = "file_path", value = "test.txt" }

File Verification

  • workspace-file — File exists in workspace
  • workspace-dir — Directory exists in workspace
  • file-contains — File contains string
[[checks]]
name = "check-name"
command = { command = "workspace-file", path = "output.txt" }

[[checks]]
name = "check-content"
command = { command = "file-contains", file = "output.txt", contains = "expected text" }

Log Verification

  • log-contains — Log contains regex pattern
  • message-contains — Assistant output contains text
[[checks]]
name = "check-name"
command = { command = "log-contains", pattern = "error|failed" }

[[checks]]
name = "check-output"
command = { command = "message-contains", text = "expected output" }

Database Verification

  • db-query — SQL query result verification
    • Numeric comparisons: ">0", ">=5", "=10", "<3", "<=2"
[[checks]]
name = "check-name"
command = { command = "db-query", db = "patents.db", query = "SELECT COUNT(*) FROM patents", expected = ">0" }

Negative Assertions

Use deny = true on any check to invert the assertion:

[[checks]]
name = "should-not-contain-error"
command = { command = "file-contains", file = "output.txt", contains = "error" }
deny = true

Failure History

Failed test history is saved to .skill-bench/test-history.json. Use --rerun-failed to re-run only tests that failed in the previous run.

License

MIT

About

Repository managed by terraform

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors