Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .devcontainer/post-create.sh
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,14 @@ browser_path = "/usr/bin/chromium"
chrome_args = ["--no-sandbox", "--disable-gpu"]
EOF

# Install skill-bench
if ! command -v skill-bench >/dev/null 2>&1; then
echo "[Devcontainer Setup] Installing skill-bench..."
curl -fsSL https://raw.githubusercontent.com/sonesuke/skill-bench/main/scripts/setup.sh | sh
else
echo "[Devcontainer Setup] skill-bench already installed: $(skill-bench --version 2>/dev/null || echo 'unknown')"
fi

echo "[Devcontainer Setup] Complete!"
else
echo "Running in CI environment, skipping development setup..."
Expand Down
62 changes: 16 additions & 46 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,62 +49,32 @@ mise.toml # Task definitions (fmt, clippy, test, pre-commit)
| `mise run test` | Run tests with `cargo test` |
| `mise run pre-commit` | Run all of the above |
| `mise run coverage` | Measure code coverage (including subprocesses) |
| `mise run skill-test` | Run all skill-bench tests |

## Skill-Bench Testing Framework

Located in `agents/skill-bench/`, this framework tests the Claude Code Plugin skills.
Test cases are in `tests/`.

### Structure

```
agents/skill-bench/
runner.sh # Test runner
cases/ # Test case definitions (TOML format)
arxiv-search/
triggering.toml
functional.toml
functional-with-limit.toml
arxiv-fetch/
triggering.toml
functional.toml
tools/ # Check scripts
check-mcp-loaded.sh
check-mcp-success.sh
check-skill-invoked.sh
check-skill-loaded.sh
check-param.sh
check-workspace.sh
```

### Test Cases

Each test case is defined in TOML format:
Requires [skill-bench](https://github.com/sonesuke/skill-bench) (set up via post-create script).

```toml
name = "test-name"
description = "Test description"
check = "check-script-name"
timeout = 120

[test_prompt]
text = "The prompt that should trigger the skill"
test_prompt = """
English prompt that should trigger the skill
"""

[[tool_calls]]
name = "tool_name"
arguments = { param = "value" }
```

### Running Tests

```bash
# Run all tests
cd agents/skill-bench
./runner.sh
[[checks]]
name = "check-name"
command = { command = "mcp-success", tool = "tool_name" }

# Run specific skill tests
./runner.sh "arxiv-search"
./runner.sh "arxiv-fetch"

# Run multiple trials
./runner.sh "*" trials=3
[[checks]]
name = "param-check"
command = { command = "tool-param", tool = "tool_name", param = "limit", value = "20" }
```

Available check types: `mcp-success`, `mcp-tool-invoked`, `mcp-loaded`, `tool-use`, `tool-param`, `skill-invoked`, `skill-loaded`, `workspace-file`, `workspace-dir`, `file-contains`, `log-contains`, `message-contains`, `db-query`.

**Note:** Test prompts must be in English to ensure consistent skill triggering.
9 changes: 0 additions & 9 deletions agents/skill-bench/cases/arxiv-fetch/functional.toml

This file was deleted.

9 changes: 0 additions & 9 deletions agents/skill-bench/cases/arxiv-fetch/triggering.toml

This file was deleted.

This file was deleted.

9 changes: 0 additions & 9 deletions agents/skill-bench/cases/arxiv-search/functional.toml

This file was deleted.

9 changes: 0 additions & 9 deletions agents/skill-bench/cases/arxiv-search/triggering.toml

This file was deleted.

199 changes: 0 additions & 199 deletions agents/skill-bench/runner.sh

This file was deleted.

13 changes: 0 additions & 13 deletions agents/skill-bench/tools/check-mcp-loaded.sh

This file was deleted.

16 changes: 0 additions & 16 deletions agents/skill-bench/tools/check-mcp-success.sh

This file was deleted.

Loading
Loading