CLI Agent Spec

Your CLI tool works perfectly for humans. For AI agents, it silently hangs, corrupts data, leaks secrets, and exhausts context windows — and you would never know.

This is a specification for building CLI tools that AI agents can call reliably: 66 documented failure modes, 135 requirements to eliminate them, and machine-readable schemas an agent can consume directly.

No existing CLI framework covers more than 58% of these challenges.

What's going wrong right now

AI agents call CLI tools constantly — to deploy infrastructure, query APIs, manage files, run pipelines. Most tools were never designed for this. Here is what agents actually encounter:

# Agent calls a list command. The tool pages output and waits for keypress.
# The agent never receives a response. The pipeline stalls. Forever.
$ kubectl get pods   # opens less, waits for input

# Agent deploys to staging. The command times out at 30s, returns exit 1.
# exit 1 means "error" — but does it mean "nothing happened" or "half-deployed"?
# The agent retries. Now it's deployed twice.
$ deploy --env staging   # exit 1 — but why? safe to retry?

# Agent reads a list of users. One username contains an emoji.
# The JSON serializer crashes on non-ASCII. The agent gets no output, no error.
$ tool users list   # silent failure on emoji in username

# Agent passes a flag after the subcommand — natural LLM ordering.
# The parser silently treats --output as a positional argument value.
# The agent receives plain text it can't parse. Exit code: 0.
$ tool list users --output json   # parsed as: list "users" "--output" "json"

These are not edge cases. They are the default behavior of most CLI tools today — including tools from major companies. The cost falls on the agent: wasted tokens, stalled pipelines, data corruption from blind retries, cascading failures with no root cause.

What this spec defines

66 failure modes — each documented with severity, frequency, detectability, token cost, time cost, and context cost from the agent's perspective. Grouped into 7 parts: ecosystem/runtime, execution, security, output, environment, errors, and observability.

135 requirements across 3 tiers:

Tier	Count	Who implements it
F — Framework-Automatic	67	The framework enforces it; command authors get it for free
C — Command Contract	27	Command authors declare it at registration
O — Opt-In	41	Applications enable it explicitly

4 JSON schemas — machine-readable type definitions for exit codes, response envelopes, tool manifests, and error details. Generate typed structs for your language directly from the schemas.

A comparison matrix — 12 existing frameworks (argparse, Click, Cobra, Clap, Typer, Commander.js, and more) scored against all 66 challenges. No framework exceeds 58%.

The three contracts that matter most

Exit codes — 14 named codes (0–13) with machine-readable guarantees per code: retryable: true/false, side_effects: "none" | "partial" | "complete". An agent receiving exit 11 (CONFLICT) knows the operation is safe to retry. Receiving exit 6 (PARTIAL_FAILURE) knows it must inspect state before retrying. See exit-code.json.

Response envelope — every command wraps its output in { ok, data, error, warnings, meta }. The same keys are always present. Agents never parse free-text to determine success or failure. See response-envelope.json.

Tool manifest — tool manifest --output json returns the complete command tree: every subcommand, flag, type, description, exit code map, and example. One call replaces O(N) --help iterations and eliminates trial-and-error argument discovery. See manifest-response.json.

What's in this repo

Path	Contents
`challenges/`	66 failure modes, each with problem, impact, solutions, 0–3 evaluation rubric, and agent workaround
`requirements/`	135 requirements with acceptance criteria, wire format, and examples
`schemas/`	JSON Schema draft-07 definitions for all 4 types
`IMPLEMENTING.md`	Implementation guide: wave-based order, goal-based paths, invariants, codegen
`comparison-matrix.md`	66 challenges × 12 frameworks coverage table
`research/`	Per-framework analysis and competitive landscape (MCP, OpenAPI, function calling)
`skills/`	Agent skills for evaluating CLIs and guiding implementation

Start here

I want to understand the problem → challenges/index.md — browse by severity. Start with §10 (interactive blocking), §43 (output size), §50 (stdin deadlock), §62 (editor trap).

I want to implement this in my framework → IMPLEMENTING.md — wave-based implementation order, or pick a goal-based path:

Fewer agent retries — 15 requirements
Less context consumed — 14 requirements
Less token spend — 12 requirements

I want to evaluate my existing CLI → use the agent skills below, or read challenges/checklist.md for a self-assessment.

I want to add a challenge or requirement → AGENTS.md

Agent skills

Three installable skills for Agent Skills-compatible agents (Claude Code, Cursor, Gemini CLI, Copilot, and others):

Skill	Purpose
`cli-agent-onboard`	Profile a CLI tool once — detects runtime, binary, flags, timeout method
`cli-agent-evaluate`	Score a CLI against a single challenge (0–3), with applicable agent workaround
`cli-agent-implement`	Guide implementing the spec in a CLI framework, tier by tier

# Install (run inside your agent)
npx skills install romamo/cli-agent-spec/skills/cli-agent-onboard
npx skills install romamo/cli-agent-spec/skills/cli-agent-evaluate
npx skills install romamo/cli-agent-spec/skills/cli-agent-implement

Contributing

The spec is a living document. New challenges are added when a failure mode is confirmed against real tooling. New requirements follow from new challenges.

Before contributing, read AGENTS.md for conventions: file format, required sections, naming rules, and how to run /validate-links to verify cross-references after any edit.

CLI Agent Spec v1.5 — 66 challenges · 135 requirements · 4 schemas · 12 frameworks evaluated

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.agents		.agents
.claude/skills/validate-links		.claude/skills/validate-links
benchmark		benchmark
challenges		challenges
docs		docs
requirements		requirements
research		research
schemas		schemas
scripts		scripts
skills		skills
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
IMPLEMENTING.md		IMPLEMENTING.md
README.md		README.md
comparison-matrix.md		comparison-matrix.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLI Agent Spec

What's going wrong right now

What this spec defines

The three contracts that matter most

What's in this repo

Start here

Agent skills

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CLI Agent Spec

What's going wrong right now

What this spec defines

The three contracts that matter most

What's in this repo

Start here

Agent skills

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages