Agent-shape testing harness. Runs runtime-in-the-loop task batteries against a tool's CLI to measure first-try command success, tokens per task, turns to completion, and invented-command count.
Each subject tool ships an agent-shape.toml declaring a fixture, a
battery of tasks (tuning + holdout), success criteria, and an LLM
judge rubric. jig spawns the agent runtime against the fixture,
records transcripts, scores them with an LLM-as-judge, and emits a
report. The runtime today is claude -p; the framework is
runtime-agnostic in language and ready for other agents (GPT, Gemini,
local models) once the runner accepts a configurable spawn command.
Agents reach for commands and arguments that tools do not always
provide. When an agent invents a non-existent command or falls back
to a generic shell tool, that is a signal about the surface, not the
agent. jig measures where that happens so the tool can be reshaped.
cargo install nomograph-jigOr from source:
git clone https://gitlab.com/nomograph/jig.git
cd jig && make buildRequires Rust 1.88+ and claude on $PATH for any command that
spawns an agent (run, rejudge). check, render, and compare
are pure offline operations.
# 1. Drop a starter agent-shape.toml into your tool's repo.
cp /path/to/jig/templates/agent-shape.toml ./agent-shape.toml
# Fill in the REPLACE-ME markers and add tasks.
# 2. Validate against your binary's --help so the rubric and the CLI
# agree about which subcommands exist.
jig check agent-shape.toml --binary $(which your-tool)
# 3. Run a small smoke battery (writes a markdown report to stdout).
jig run agent-shape.toml --tuning-only --n 3
# 4. Run a real baseline. Checkpoint so a killed run resumes.
jig run agent-shape.toml --n 10 \
--output baseline.json --format json \
--checkpoint baseline.checkpoint.jsonl
# 5. Render the JSON as Markdown without re-spending API.
jig render baseline.json --output baseline.md
# 6. After a treatment, compare the two reports.
jig compare baseline.json treated.json --output delta.md| Command | What it does |
|---|---|
jig run [path] |
Spawn the agent against the fixture, score every trial, emit a report. Supports --tuning-only, --holdout-only, --n, --judge-model, --subject, --output, --format {json,markdown}, --checkpoint. |
jig check [path] [--binary <bin>] |
Parse the TOML and (optionally) cross-reference [commands].top_level with <binary> --help. Reports drift in either direction. |
jig render <json> |
Re-emit a previously-saved JSON report as Markdown. No API calls. |
jig compare <before.json> <after.json> |
Per-cell delta table (mean score, completion rate, tokens, turns, invented commands). No API calls. |
jig rejudge <toml> --from <ckpt> --to <ckpt> |
Re-score the trial transcripts in a checkpoint against an updated rubric. Costs judge tokens, not agent tokens. Supports resume. |
jig --help and jig <subcommand> --help are the authoritative
reference; this table tracks the surface as of v0.1.0.
The agent-shape.toml declares everything the harness needs:
[subject]: tool name, binary, description, optionalversion_pinfor retrospective runs against tagged versions.[fixture]: idempotent setup script, optional cleanup, working directory the agent operates in. Setup runs before every trial so state is isolated.[run]: trials per cell (n), agent models under test, turn cap, per-trial wall-clock timeout.[judge]: judge model (default Haiku 4.5),double_scorefor IRR, rubric prose, required JSON fields.[tasks.tuning]and[tasks.holdout]: task IDs, prompts, success criteria, and provenance (author,created_at,sealed_against_tag).[commands].top_level(optional): subcommands the rubric claims exist;jig check --binarycross-references this with the CLI.
examples/agent-shape.example.toml is a worked example;
templates/agent-shape.toml is the starter for new adopters. The
schema lives in src/schema.rs.
- Rubric drift is the dominant source of measurement error. When
a rubric misses real commands, the judge counts them as inventions
and produces phantom regressions.
jig check --binarymechanically catches the binary side; rubric prose still has to be hand-maintained. Land subcommand changes and rubric updates in the same commit. - Judge variance: typical IRR delta is 0.05 to 0.30 per cell at n=5. Don't draw conclusions from sub-0.20 effects without n>=20 or Cliff's delta significance testing.
- Fixture leakage: subject tools that read environment variables for session or run identity should strip those variables in their fixture scripts so trials start from a clean slate.
- Holdout corpus: tuning-only studies overfit to the designer's
tasks. The schema supports a
tasks.holdoutbattery so independent authors can write tasks against the same surface without seeing the tuning set.
make build # release binary, copied to ./jig
make test # build + run all tests
make lint # cargo clippy --all-targets -- -D warnings
make fmt # cargo fmt
make check # build + smoke test --help and check on the examplenomograph-jig is a library crate as well as a binary. The
runner, judge, report, schema, and checkpoint modules are
public so callers can drive the harness programmatically without
shelling out to the CLI.
MIT. See LICENSE.
Part of Nomograph Labs.