Description
The CLI today has validate (partial), run (stub), version (stub). 0.2.0 ships a functional surface that wires all 0.2.0 modules together.
Four commands:
agentanvil validate <contract.yaml> — runs static analyzer (add evaluator Layer 2 (single LLM-as-judge) with structured output #11 ), exits non-zero on fatal.
agentanvil run --contract X.yaml --agent-path ./agent/ [--backend {direct,agentloom,mock}] [--runner {subprocess,docker}] [--record PATH] [--seed N] [--out-dir PATH] — end-to-end: analyze agent, generate scenarios, run, evaluate, report.
agentanvil replay --recording PATH [--contract X.yaml] [--out-dir PATH] — reproduce a recorded run with MockBackend.
agentanvil version — prints AgentAnvil version, git commit hash, installed optional extras.
Proposal
1. run command skeleton:
# src/agentanvil/cli/main.py
@app .command ()
def run (
contract : Path ,
agent_path : Path ,
backend : str = "direct" ,
runner : str = "subprocess" ,
record : Path | None = None ,
seed : int = 42 ,
out_dir : Path = Path ("./agentanvil-results" ),
max_scenarios : int = 10 ,
):
# 1. Load + validate contract.
c = AgentContract .from_yaml (contract .read_text ())
diagnostics = analyze (c )
if diagnostics .has_fatal :
_print_diagnostics (diagnostics )
raise typer .Exit (1 )
# 2. Analyze agent.
profile = analyze_agent (agent_path )
# 3. Generate scenarios.
scenarios = generate (c , config = GeneratorConfig (seed = seed , max_per_category = max_scenarios // 3 ))
# 4. Resolve backend + runner.
be = _resolve_backend (backend , record = record )
rn = _resolve_runner (runner )
# 5. Execute each scenario.
records = []
for scenario in scenarios :
result = await rn .run (agent_path = agent_path , scenario_json = scenario .model_dump_json (), timeout_ms = c .constraints .max_latency_ms or 60000 )
trace = _parse_trace (result )
score = _evaluate (c , trace , be )
records .append (RunRecord (...))
# 6. Write reports.
out_dir .mkdir (parents = True , exist_ok = True )
for rec in records :
(out_dir / f"{ rec .run_id } .json" ).write_text (render (rec , ReportFormat .JSON ))
(out_dir / f"{ rec .run_id } .html" ).write_text (render (rec , ReportFormat .HTML ))
2. replay command:
@app .command ()
def replay (
recording : Path ,
contract : Path | None = None ,
out_dir : Path = Path ("./agentanvil-replay" ),
):
# 1. Load recording.
rec_env = Recording .from_file (recording )
# 2. Build MockBackend.
mock = MockBackend (recording )
# 3. Replay every entry through the same pipeline.
# 4. Assert byte-for-byte equality with expected (if present).
3. version command:
@app .command ()
def version ():
import importlib .metadata as md
print (f"agentanvil { md .version ('agentanvil' )} " )
print (f"python { sys .version .split ()[0 ]} " )
for extra in ("agentloom" , "docker" , "viz" , "stats" , "replication" , "security" , "cicd" ):
try :
md .distribution (_extra_pkg (extra ))
print (f" [{ extra } ] installed" )
except md .PackageNotFoundError :
pass
Scope
src/agentanvil/cli/main.py — fill out run and replay, improve version.
src/agentanvil/cli/_helpers.py — new, backend/runner resolution.
tests/cli/test_run.py — end-to-end with MockBackend.
tests/cli/test_replay.py
tests/cli/test_validate.py
tests/cli/test_version.py
Regression tests
test_cli_validate_on_fixture_contract_exits_zero
test_cli_validate_on_contradictory_contract_exits_nonzero
test_cli_run_produces_json_and_html_reports
test_cli_run_respects_seed_for_scenario_generation
test_cli_run_honours_timeout_from_contract
test_cli_replay_byte_for_byte_identical_to_record
test_cli_replay_fails_clearly_on_missing_recording_entry
test_cli_version_prints_installed_extras
Notes
CLI uses Typer + Rich (already core deps).
--backend mock accepts --recording PATH implicitly via the replay path.
Depends on: add evaluator Layer 2 (single LLM-as-judge) with structured output #11 , add reporter with JSON, HTML, Markdown outputs #12 , add CLI #13 , add quickstart LangChain example with end-to-end CI and 10-minute budget #14 , add portability invariant and determinism CI jobs #15 , scaffold mkdocs-material docs site #16 , add first Tier A single-agent case study under examples/case-studies/ #17 , ship 0.2.0: single-agent MVP with LangChain quickstart under 10 minutes #18 , #019 — all 0.2.0 modules converge here.
Blocks: #021 (quickstart), #024 (first case study uses agentanvil run).
Description
The CLI today has
validate(partial),run(stub),version(stub). 0.2.0 ships a functional surface that wires all 0.2.0 modules together.Four commands:
agentanvil validate <contract.yaml>— runs static analyzer (add evaluator Layer 2 (single LLM-as-judge) with structured output #11), exits non-zero on fatal.agentanvil run --contract X.yaml --agent-path ./agent/ [--backend {direct,agentloom,mock}] [--runner {subprocess,docker}] [--record PATH] [--seed N] [--out-dir PATH]— end-to-end: analyze agent, generate scenarios, run, evaluate, report.agentanvil replay --recording PATH [--contract X.yaml] [--out-dir PATH]— reproduce a recorded run withMockBackend.agentanvil version— prints AgentAnvil version, git commit hash, installed optional extras.Proposal
1.
runcommand skeleton:2.
replaycommand:3.
versioncommand:Scope
src/agentanvil/cli/main.py— fill outrunandreplay, improveversion.src/agentanvil/cli/_helpers.py— new, backend/runner resolution.tests/cli/test_run.py— end-to-end with MockBackend.tests/cli/test_replay.pytests/cli/test_validate.pytests/cli/test_version.pyRegression tests
test_cli_validate_on_fixture_contract_exits_zerotest_cli_validate_on_contradictory_contract_exits_nonzerotest_cli_run_produces_json_and_html_reportstest_cli_run_respects_seed_for_scenario_generationtest_cli_run_honours_timeout_from_contracttest_cli_replay_byte_for_byte_identical_to_recordtest_cli_replay_fails_clearly_on_missing_recording_entrytest_cli_version_prints_installed_extrasNotes
--backend mockaccepts--recording PATHimplicitly via thereplaypath.agentanvil run).