TOML-based Claude skill test runner (nextest-style)
skill-bench is a Rust-based test runner that executes test cases defined in TOML files. Designed for testing Claude Code skills.
Distributable as a standalone binary - users don't need a Rust environment.
- TOML test definitions: Write tests in TOML without writing Rust code
- nextest-style interface: Parallel execution, filtering, failure history management
- Fast parallel execution: Multi-threaded execution via rayon
- Rich assertions: 18 verification types (skills, MCP, tools, files, databases)
- Embedded assets: Harness plugin embedded in binary for standalone distribution
curl -fsSL https://raw.githubusercontent.com/sonesuke/skill-bench/main/scripts/setup.sh | shSupports Linux (x86_64, aarch64) and macOS (aarch64).
cargo build --release
cargo install --path .skill-bench list# Run all tests (uses default pattern: cases)
skill-bench run
# Run tests from specific directory (automatically finds all .toml files)
skill-bench run "cases"
skill-bench run "cases/concept-interviewing"
# Or use explicit glob pattern
skill-bench run "cases/**/*.toml"
# Filter by test name (regex)
skill-bench run --filter "functional-.*"
# Filter by skill name
skill-bench run --skill investigation-recording
# Rerun only failed tests
skill-bench run --rerun-failed
# Specify parallel threads
skill-bench run --threads 4
# Specify plugin directory (path to directory containing .claude-plugin/)
skill-bench run --plugin-dir ./patent-kit
# Persist Claude session logs to directory
skill-bench run --log logs # Save logs to logs/ directory
skill-bench run -l . # Short form: save to current directory# Display timeline of test execution from log file
skill-bench timeline /path/to/log-file.log
# Show detailed output
skill-bench timeline /path/to/log-file.log --verboseThe pattern argument accepts either a directory or glob pattern:
cases- All TOML files recursively undercases/(directory mode)cases/concept-interviewing- All TOML files in a specific subdirectorycases/**/*.toml- Explicit glob pattern for all TOML files recursivelycases/*/*.toml- TOML files only in immediate subdirectoriescases/functional-*.toml- TOML files matching a specific pattern
When a directory is specified, all .toml files are automatically found recursively.
# List all tests (uses default pattern: cases)
skill-bench list
# List tests from specific directory
skill-bench list "cases/concept-interviewing"skill-bench/
├── Cargo.toml # Rust project config
├── src/ # Source code
│ ├── main.rs # Entry point
│ ├── cli/ # CLI argument definitions
│ ├── models/ # Data models
│ ├── runtime/ # Test discovery & execution
│ ├── assertions/ # Assertion library
│ ├── output/ # Result reporting
│ ├── state/ # Failure history management
│ └── assets/ # Embedded assets
│ └── harness-plugin/
├── cases/ # TOML test cases
│ ├── claim-analyzing/
│ ├── concept-interviewing/
│ └── ...
└── target/ # Build output
name = "test-name"
description = "Test description"
timeout = 120 # seconds
test_prompt = """
Test prompt here...
"""
# Setup steps (executed in order before the test)
[[setup]]
name = "create-sample-file"
path = "test.txt"
content = "File content"
[[setup]]
name = "prepare-directory"
command = "mkdir -p subdir && echo 'done' > subdir/file.txt"
[[checks]]
name = "check_name"
command = { command = "skill-invoked", skill = "skill-name" }
[answers]
"question_key" = "answer_value"Setup steps run in the test workspace before the test prompt is executed. Steps are executed in the order they appear. If any step fails, the entire test fails.
Two types are supported, distinguished automatically by the fields present:
- File (
path+content) — Creates a file (and any missing parent directories) in the workspace - Script (
command) — Executes a shell command viabash -cin the workspace
# File setup
[[setup]]
name = "optional-descriptive-name"
path = "dir/subdir/file.txt"
content = "File content here"
# Script setup
[[setup]]
name = "optional-descriptive-name"
command = "echo 'Hello' > greeting.txt && mkdir -p output"Run skill-bench help <type> for detailed help on any check type.
skill-loaded— Skill was loadedskill-invoked— Skill was invoked
[[checks]]
name = "check-name"
command = { command = "skill-invoked", skill = "my-skill" }mcp-loaded— MCP server was loadedmcp-tool-invoked— MCP tool was invokedmcp-success— MCP tool succeeded
[[checks]]
name = "check-name"
command = { command = "mcp-loaded", server = "filesystem" }tool-use— Tool was called (partial match)tool-param— Tool was called with a specific parameter
[[checks]]
name = "check-name"
command = { command = "tool-use", tool = "Read" }
[[checks]]
name = "check-param"
command = { command = "tool-param", tool = "Read", param = "file_path", value = "test.txt" }workspace-file— File exists in workspaceworkspace-dir— Directory exists in workspacefile-contains— File contains string
[[checks]]
name = "check-name"
command = { command = "workspace-file", path = "output.txt" }
[[checks]]
name = "check-content"
command = { command = "file-contains", file = "output.txt", contains = "expected text" }log-contains— Log contains regex patternmessage-contains— Assistant output contains text
[[checks]]
name = "check-name"
command = { command = "log-contains", pattern = "error|failed" }
[[checks]]
name = "check-output"
command = { command = "message-contains", text = "expected output" }db-query— SQL query result verification- Numeric comparisons:
">0",">=5","=10","<3","<=2"
- Numeric comparisons:
[[checks]]
name = "check-name"
command = { command = "db-query", db = "patents.db", query = "SELECT COUNT(*) FROM patents", expected = ">0" }Use deny = true on any check to invert the assertion:
[[checks]]
name = "should-not-contain-error"
command = { command = "file-contains", file = "output.txt", contains = "error" }
deny = trueFailed test history is saved to .skill-bench/test-history.json. Use --rerun-failed to re-run only tests that failed in the previous run.
MIT