Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Relevant leaves:
- **[test-environment-fixtures-and-ci-fail-closed-policy.md](test-environment-fixtures-and-ci-fail-closed-policy.md)** — How test posture suppresses side effects, how fixture replay works, and why missing recordings fail closed in CI.
- **[test-lane-coverage-map.md](test-lane-coverage-map.md)** — Which subsystem contracts are guarded by fast regression, integration, end-to-end, conformance, and compatibility lanes, without overclaiming the hidden runner layout.
- **[e2e-harness-reality-boundaries.md](e2e-harness-reality-boundaries.md)** — Which end-to-end harnesses may shorten setup but still need to preserve real permission, transport, auth-proxy, and credential-cache paths.
- **[released-cli-e2e-test-set.md](released-cli-e2e-test-set.md)** — Public-runtime end-to-end oracles gathered by exercising a shipped Claude CLI build against a real local workspace, plus the parity-critical cases a rebuild must not skip.
- **[test-seams-reset-hooks-and-injected-dependencies.md](test-seams-reset-hooks-and-injected-dependencies.md)** — The narrow seams the product uses to keep hard behaviors testable without turning the whole runtime into a debug harness.
- **[native-test-derived-asset-provenance-and-acceptance-rules.md](native-test-derived-asset-provenance-and-acceptance-rules.md)** — How native test knowledge should be normalized into clean-room contract assets and how those assets should be linked back to their owning domains.
- **[evidence-levels-and-missing-artifacts.md](evidence-levels-and-missing-artifacts.md)** — What this source snapshot proves, what it only strongly suggests, and which missing artifacts still block exact runner-level reproduction.
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
---
title: "Released CLI E2E Test Set"
owners: [bingran-you]
soft_links:
- /integrations/clients/structured-io-and-headless-session-loop.md
- /platform-services/workspace-trust-dialog-and-persistence.md
- /platform-services/session-cost-accounting-and-restoration.md
- /runtime-orchestration/sessions/resume-path.md
- /ui-and-experience/startup-and-onboarding/startup-welcome-dashboard-and-feed-rotation.md
- /ui-and-experience/dialogs-and-approvals/permission-prompt-shell-and-worker-states.md
- /ui-and-experience/dialogs-and-approvals/structured-diff-rendering-and-highlight-fallback.md
- /tools-and-permissions/permissions/e2e-permission-testing-contracts.md
- /product-surface/interaction-modes.md
---

# Released CLI E2E Test Set

Source-derived contracts are still the primary clean-room evidence for this tree, but they are not enough on their own for end-to-end rebuild work. A released CLI can be exercised directly, and its public runtime behavior becomes a second kind of oracle: not hidden implementation, but what a real user actually experiences.

This leaf captures that public-runtime oracle set from a local run of the shipped `claude` CLI on April 9, 2026. The observed build reported version `2.1.89`, authenticated successfully through a Foundry-backed account, and was exercised in both headless and interactive modes on a local macOS terminal.

## Why this leaf exists

A clean-room rebuild can easily pass its own local tests while still feeling unlike the real product at the edges that matter most:

- startup and trust gating
- headless envelope shape
- stream transport behavior
- permission prompts and remembered approvals
- session durability and cwd-based resume
- the rhythm of a real coding turn that reads, edits, reruns, and summarizes

Those are all externally visible contracts. They can and should become explicit E2E test targets.

## Evidence boundary

This leaf records only public behavior that was directly observed from the shipped CLI:

- public command-line flags and subcommands
- public terminal UI flows
- files the CLI itself wrote into the local user state directory
- session recovery behavior visible through subsequent CLI invocations

It should not become a transcript dump. Keep raw logs local, and normalize them here into assertions, scenario shapes, and failure modes.

## Mandatory scenario families

### 1. Discovery and health smoke lane

A rebuild should have a fast lane that exercises the released binary surface before any deeper coding workflow:

- `--help` and `--version` must succeed and expose the current command families
- auth health should be queryable without opening the TUI, including both machine-readable and human-readable status
- agent discovery should be externally visible from a top-level command, not only from inside an interactive session
- auto-mode or policy classification should expose an inspectable effective config, not just hidden defaults

The oracle is not one exact text block. The oracle is that these are real, scriptable health surfaces with stable exit behavior.

### 2. Headless `--print` lane

The released CLI exposed several parity-critical headless behaviors:

- a minimal one-shot prompt path in `--print` mode
- a cheaper `--bare` posture that suppresses much of the normal startup enrichment
- a budget cap path where a low `--max-budget-usd` can fail before any useful assistant text arrives
- a JSON envelope path whose result contains both metadata and a human-readable `result`, not only raw assistant text
- JSON-schema validation that reports structured output separately from the human result text
- cwd-local continue behavior, where `-c` or `--continue` can answer questions about the previous turn without manually passing a session ID

Equivalent tests should explicitly cover both success and failure envelopes. A rebuild that only checks plain-text success misses one of the most important public automation surfaces.

### 3. Stream-JSON transport lane

The released CLI's stream mode proved several externally visible rules:

- `--print` plus `--output-format=stream-json` requires `--verbose`
- the stream starts with a system init event before assistant output
- a valid user input event can be supplied over stdin as JSON
- `--replay-user-messages` re-emits the incoming user event on stdout
- partial assistant events and a final result event are separate concepts
- the current release may emit extra assistant-side content blocks in verbose stream mode beyond the final plain-text answer

The important rebuild rule is not to hardcode today's exact event inventory. It is to make the wire contract explicit and testable:

- init handshake
- per-event framing
- replay behavior
- partial-versus-final separation
- predictable error handling for malformed input

### 4. Interactive startup and trust lane

A fresh interactive session in a new local workspace did not drop straight into a plain prompt. It first asked the user whether the folder was trusted, then entered a richer startup dashboard with project identity, tips, and recent-session context.

Equivalent tests should protect:

- first-entry trust gating for an unapproved workspace
- persistence of that trust decision for later launches
- a startup dashboard rather than a bare REPL prompt
- a discoverable shortcut overlay
- predictable terminal-exit handling, including the observed double-`Ctrl-C` confirmation flow

If a rebuild only tests a plain line editor, it will miss the public startup contract users feel first.

### 5. Permissioned coding lane

The most important real-work oracle was a tiny bugfix session in a temporary git workspace:

- the assistant proposed a shell command to run tests and triggered a permission prompt
- the failing test output was summarized in the UI instead of dumping the entire log at full height
- file reads surfaced as small activity summaries
- an edit proposal rendered as a diff preview and required a separate approval
- remembered approval was scoped to the specific operation class, not globally to every later tool action
- after the edit, the assistant reran tests, observed success, and gave a short root-cause explanation

This lane matters more than many synthetic tool-loop tests because it captures the actual rhythm a coding user depends on:

- inspect
- execute
- approve
- edit
- verify
- summarize

### 6. Durable artifact and resume lane

The released CLI wrote per-project state under `~/.claude/projects/...`, using a sanitized project-path key. The durable transcript for the interactive session was stored as JSONL and included at least these user-visible event families:

- user turns
- assistant tool requests
- tool results
- edit records with structured patch information
- final assistant text

Two subtle but important persistence facts also showed up:

- `--no-session-persistence` still created the project-scoped directory and memory folder, even though it did not create a normal session transcript
- cwd-based continue could recover the last session's context without an explicit session ID

A rebuild should therefore test both positive persistence and negative persistence. "No session persistence" is not the same thing as "zero filesystem side effects."

## Current clean-room gap check

The current Python rebuild already protects several useful local lanes:

- local interactive prompt loop and slash commands
- prompt history and explicit session resume by ID
- structured NDJSON request-response control flow
- scenario goldens for review, init, tool loops, permission probes, and compaction
- basic interactive approval prompting

That is a strong local foundation. It is not yet the same thing as released-CLI E2E parity.

### Capability gaps still visible

The shipped CLI behaviors above imply product areas that the current rebuild does not yet expose as first-class runtime surfaces:

- trust gate and trust persistence
- startup dashboard and richer terminal startup shell
- public auth, doctor, plugin, MCP, install, update, and auto-mode command families
- released-style headless `--print` flag matrix and envelopes
- public cwd-based `--continue` / `--resume` entrypoint routing
- richer approval dialogs with remembered decisions scoped by tool or command class

Those are implementation gaps, not merely missing assertions.

### Test gaps even where adjacent features already exist

Even in the areas the rebuild has started, the following cases remain under-protected relative to the released CLI:

- headless budget-cap failure envelopes
- successful schema validation in the public CLI envelope shape, not only in an internal structured server response
- `--output-format=stream-json` gating on `--verbose`
- replayed user-message echoes and partial assistant events in stream mode
- negative persistence semantics for `--no-session-persistence`
- cwd-based continue after a prior coding turn
- remembered approval scopes for shell commands versus edit approvals
- deny-path behavior for shell or edit approvals in the same real-work session
- durable session-artifact assertions that inspect the stored transcript shape, not only in-memory summaries
- non-TTY behavior for commands that actually require raw terminal capabilities

These should become explicit parity tests before claiming a rebuild feels end-to-end correct.

## Reconstruction rule

Use this leaf as the public-runtime complement to the source-derived verification leaves:

- source-derived leaves explain what the product architecture must preserve
- this leaf explains what a shipped build actually feels like when exercised end to end

A faithful rebuild should keep both. If they disagree, prefer the narrower claim and investigate. Public-runtime behavior is excellent for E2E oracles, but it does not by itself reveal why the product was built that way.

## Failure modes

- **headless false confidence**: the rebuild passes local JSON tests but its public envelope and error modes do not match a released CLI
- **transport simplification drift**: stream mode works only for a request-response lab client and not for real event-stream consumers
- **startup flattening**: the rebuild starts in a bare prompt and skips trust, dashboard, or session-restoration cues users rely on
- **permission amnesia**: approvals exist, but remembered scopes and deny behavior do not match real coding sessions
- **resume illusion**: explicit session IDs work, but cwd-local continuation and durable transcript recovery do not
- **negative-path blind spot**: low-budget, malformed-input, non-TTY, and no-persistence behaviors are untested even though real users hit them first
Loading