Test your Claude Code harness the same way you test code — run a command, get a report, know what works.
Harness Test is a CLI testing framework for Claude Code harness components — CLAUDE.md, skills, hooks, subagents, and AGENTS.md.
It provides three testing layers with increasing depth and cost:
| Layer | What it does | Token cost |
|---|---|---|
| Static validation | Syntax, structure, broken references, rule conflicts | Zero |
| Behavioral testing | Runs prompts against Agent SDK, evaluates assertions | Per-test |
| Statistical evaluation | Multi-run consistency and variance analysis | Multi-run |
| Category | Technology |
|---|---|
| Language | Python 3.12+ |
| CLI | Typer |
| Data Models | Pydantic v2 |
| Terminal Output | Rich (tables, Live streaming) |
| YAML Parsing | ruamel.yaml (round-trip fidelity) |
| Test Executor | Claude Agent SDK |
| Package Manager | uv |
| Linter / Formatter | Ruff |
| Type Checker | Pyright |
| Tests | pytest + pytest-asyncio |
┌─────────────────────────────────────────────────────┐
│ CLI (Typer) │
│ init · run · report │
├──────────┬──────────┬──────────────┬────────────────┤
│ Scanner │Validator │ Runner │ Reporter │
│ │ │ │ │
│ claude_md│ claude_md│ executor │ formatter │
│ skills │ skills │ (Agent SDK) │ (Rich tables) │
│ hooks │ hooks │ assertions │ │
│ subagents│ subagents│ collector │ │
│ agents_md│references│ │ │
│ memory │ rules │ │ │
├──────────┴──────────┴──────────────┴────────────────┤
│ Pydantic Models │
│ component · config · test_spec · results · valid. │
└─────────────────────────────────────────────────────┘
Data Flow:
init: scan_project() → detect_auth() → write_config()
static: load_config() → scan_project() → validate_all() → Rich table
behavior: load_config() → load_all_specs() → run_tests() → save_results() → Rich Live
report: load_results() → scan_project() → render_report() → Rich tables / JSON
src/harness_test/
cli.py # All CLI commands (init, run, report)
config.py # Auth detection + config management
spec_loader.py # YAML test spec loader
main.py # Typer app entry point
models/ # Pydantic models (component, config, test_spec, results, validation)
scanner/ # One scanner per harness type + scan_project() orchestrator
validator/ # One validator per concern + validate_all() orchestrator
runner/ # Agent SDK executor, assertions, result collector, run_tests()
reporter/ # Rich formatter + render_report()
tests/ # Mirrors src/ structure
.harness-test/ # Runtime config (config.yaml, test results)
- Python 3.12+
- uv package manager
- Claude Code subscription (for behavioral tests) or Anthropic API key
Run directly without installing — like npx:
uvx harness-test init
uvx harness-test run --layer static
uvx harness-test reportuv tool install harness-test
harness-test initOr with pip:
pip install harness-test
harness-test initgit clone https://github.com/hgflima/harness-test.git
cd harness-test
uv sync
uv run harness-test initDiscover harness components, detect auth method, and create config:
harness-test init# Static validation (zero tokens)
harness-test run --layer static
# Behavioral tests via Agent SDK
harness-test run
# Re-run only failed tests
harness-test run --failed# Rich terminal output
harness-test report
# JSON for CI pipelines
harness-test report --jsonuv run pytest # Run test suite (218 tests)
uv run ruff check src/ tests/ # Lint
uv run ruff format src/ tests/ # Format- Pydantic everywhere — All data crosses module boundaries as Pydantic models, never raw dicts
- Scanner/Validator symmetry — Each harness component type has a matched scanner + validator pair
- ruamel.yaml over PyYAML — Round-trip fidelity for YAML specs
- Executor isolation — Each behavioral test runs in its own Agent SDK session
- Exit codes —
0success,1failures/errors,2runtime/config errors
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Run tests and linting (
uv run pytest && uv run ruff check src/ tests/) - Commit your changes
- Open a Pull Request
- Claude Code by Anthropic
- Claude Agent SDK for behavioral test execution
- Typer for CLI ergonomics
- Rich for terminal output
- Pydantic for data validation
If this project helps you test your Claude Code harness, consider giving it a star.