jig

Agent-shape testing harness. Runs runtime-in-the-loop task batteries against a tool's CLI to measure first-try command success, tokens per task, turns to completion, and invented-command count.

What it is

Each subject tool ships an agent-shape.toml declaring a fixture, a battery of tasks (tuning + holdout), success criteria, and an LLM judge rubric. jig spawns the agent runtime against the fixture, records transcripts, scores them with an LLM-as-judge, and emits a report. The runtime today is claude -p; the framework is runtime-agnostic in language and ready for other agents (GPT, Gemini, local models) once the runner accepts a configurable spawn command.

Why this exists

Agents reach for commands and arguments that tools do not always provide. When an agent invents a non-existent command or falls back to a generic shell tool, that is a signal about the surface, not the agent. jig measures where that happens so the tool can be reshaped.

Install

cargo install nomograph-jig

Or from source:

git clone https://gitlab.com/nomograph/jig.git
cd jig && make build

Requires Rust 1.88+ and claude on $PATH for any command that spawns an agent (run, rejudge). check, render, and compare are pure offline operations.

Quickstart

# 1. Drop a starter agent-shape.toml into your tool's repo.
cp /path/to/jig/templates/agent-shape.toml ./agent-shape.toml
# Fill in the REPLACE-ME markers and add tasks.

# 2. Validate against your binary's --help so the rubric and the CLI
#    agree about which subcommands exist.
jig check agent-shape.toml --binary $(which your-tool)

# 3. Run a small smoke battery (writes a markdown report to stdout).
jig run agent-shape.toml --tuning-only --n 3

# 4. Run a real baseline. Checkpoint so a killed run resumes.
jig run agent-shape.toml --n 10 \
  --output baseline.json --format json \
  --checkpoint baseline.checkpoint.jsonl

# 5. Render the JSON as Markdown without re-spending API.
jig render baseline.json --output baseline.md

# 6. After a treatment, compare the two reports.
jig compare baseline.json treated.json --output delta.md

Command Reference

Command	What it does
`jig run [path]`	Spawn the agent against the fixture, score every trial, emit a report. Supports `--tuning-only`, `--holdout-only`, `--n`, `--judge-model`, `--subject`, `--output`, `--format {json,markdown}`, `--checkpoint`.
`jig check [path] [--binary <bin>]`	Parse the TOML and (optionally) cross-reference `[commands].top_level` with `<binary> --help`. Reports drift in either direction.
`jig render <json>`	Re-emit a previously-saved JSON report as Markdown. No API calls.
`jig compare <before.json> <after.json>`	Per-cell delta table (mean score, completion rate, tokens, turns, invented commands). No API calls.
`jig rejudge <toml> --from <ckpt> --to <ckpt>`	Re-score the trial transcripts in a checkpoint against an updated rubric. Costs judge tokens, not agent tokens. Supports resume.

jig --help and jig <subcommand> --help are the authoritative reference; this table tracks the surface as of v0.1.0.

How a study runs

The agent-shape.toml declares everything the harness needs:

[subject]: tool name, binary, description, optional version_pin for retrospective runs against tagged versions.
[fixture]: idempotent setup script, optional cleanup, working directory the agent operates in. Setup runs before every trial so state is isolated.
[run]: trials per cell (n), agent models under test, turn cap, per-trial wall-clock timeout.
[judge]: judge model (default Haiku 4.5), double_score for IRR, rubric prose, required JSON fields.
[tasks.tuning] and [tasks.holdout]: task IDs, prompts, success criteria, and provenance (author, created_at, sealed_against_tag).
[commands].top_level (optional): subcommands the rubric claims exist; jig check --binary cross-references this with the CLI.

examples/agent-shape.example.toml is a worked example; templates/agent-shape.toml is the starter for new adopters. The schema lives in src/schema.rs.

Methodology notes

Rubric drift is the dominant source of measurement error. When a rubric misses real commands, the judge counts them as inventions and produces phantom regressions. jig check --binary mechanically catches the binary side; rubric prose still has to be hand-maintained. Land subcommand changes and rubric updates in the same commit.
Judge variance: typical IRR delta is 0.05 to 0.30 per cell at n=5. Don't draw conclusions from sub-0.20 effects without n>=20 or Cliff's delta significance testing.
Fixture leakage: subject tools that read environment variables for session or run identity should strip those variables in their fixture scripts so trials start from a clean slate.
Holdout corpus: tuning-only studies overfit to the designer's tasks. The schema supports a tasks.holdout battery so independent authors can write tasks against the same surface without seeing the tuning set.

Building

make build    # release binary, copied to ./jig
make test     # build + run all tests
make lint     # cargo clippy --all-targets -- -D warnings
make fmt      # cargo fmt
make check    # build + smoke test --help and check on the example

Library use

nomograph-jig is a library crate as well as a binary. The runner, judge, report, schema, and checkpoint modules are public so callers can drive the harness programmatically without shelling out to the CLI.

License

MIT. See LICENSE.

Part of Nomograph Labs.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
examples		examples
scripts		scripts
src		src
templates		templates
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
agent-shape.toml		agent-shape.toml
avatar.png		avatar.png
avatar.svg		avatar.svg
clippy.toml		clippy.toml
deny.toml		deny.toml
hero.png		hero.png
hero.svg		hero.svg
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jig

What it is

Why this exists

Install

Quickstart

Command Reference

How a study runs

Methodology notes

Building

Library use

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

jig

What it is

Why this exists

Install

Quickstart

Command Reference

How a study runs

Methodology notes

Building

Library use

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages