agent-team-foundation · bingran-you · Apr 9, 2026 · Apr 9, 2026
@@ -16,6 +16,7 @@ This subdomain captures cross-cutting knowledge about how the observed Claude Co
 Relevant leaves:
 
 - **[test-framework-overview.md](test-framework-overview.md)** — The layered shape of the current test system, including the visible tier model and the boundary between confirmed and inferred runner details.
+- **[real-cli-e2e-scenario-corpus.md](real-cli-e2e-scenario-corpus.md)** — A live-observed black-box scenario set for validating whether a rebuild behaves like a real Claude Code CLI across startup, headless runs, session continuity, structured I/O, and diagnostics.
 - **[test-runtime-mode-and-determinism.md](test-runtime-mode-and-determinism.md)** — How `NODE_ENV=test` behaves as a supported runtime posture, including in-memory config behavior, reduced side effects, and deterministic test-only branches.
 - **[test-environment-fixtures-and-ci-fail-closed-policy.md](test-environment-fixtures-and-ci-fail-closed-policy.md)** — How test posture suppresses side effects, how fixture replay works, and why missing recordings fail closed in CI.
 - **[test-lane-coverage-map.md](test-lane-coverage-map.md)** — Which subsystem contracts are guarded by fast regression, integration, end-to-end, conformance, and compatibility lanes, without overclaiming the hidden runner layout.

@@ -0,0 +1,229 @@
+---
+title: "Real CLI E2E Scenario Corpus"
+owners: [bingran-you]
+soft_links:
+  - /platform-services/interactive-startup-and-project-activation.md
+  - /platform-services/doctor-command-and-health-diagnostics.md
+  - /product-surface/interaction-modes.md
+  - /product-surface/session-utility-commands.md
+  - /runtime-orchestration/sessions/resume-path.md
+  - /integrations/clients/structured-io-and-headless-session-loop.md
+  - /tools-and-permissions/permissions/permission-mode-transitions-and-gates.md
+  - /tools-and-permissions/filesystem-and-shell/path-and-filesystem-safety.md
+---
+
+# Real CLI E2E Scenario Corpus
+
+This leaf captures a black-box test corpus derived from running a real local Claude Code CLI, not from reconstructing its internals. The goal is to give a rebuild effort concrete end-to-end handles: which behaviors to probe first, which outputs should be treated as stable protocol, and which observations are real but too version-sensitive to use as hard golden assertions.
+
+The corpus exists because the tree already described many subsystem contracts, but it still lacked one operator-facing document that answers a simpler question: if a faithful clone is "working," what should a real terminal session, a real headless run, and a real session-resume flow actually feel like from the outside?
+
+## Observation boundary
+
+Observed on macOS from a working local Claude Code CLI on April 9, 2026.
+
+Observed surfaces:
+
+- interactive `claude`
+- headless `claude -p`
+- `--output-format json`
+- `--json-schema`
+- `--input-format stream-json --output-format stream-json --verbose`
+- `-c`, `-r`, and `--fork-session`
+- `--no-session-persistence`
+- `claude doctor` in both non-TTY and TTY contexts
+
+Observed environment notes:
+
+- an initial version check returned `2.1.27`
+- interactive startup later auto-updated the installed CLI to `2.1.97`
+- exact version numbers are not the contract; the load-bearing fact is that install and update state can visibly change the product surface between runs
+
+Fixture shape used for these observations:
+
+- one small main workspace containing `README.md`, `todo.txt`, and editable scratch files
+- one sibling directory outside that workspace for path-boundary probing
+- one isolated fresh directory for `--no-session-persistence` checks
+
+## How to use this corpus
+
+Treat each scenario as a product contract, not as a screenshot test.
+
+Hard assertions should prefer:
+
+- exit behavior
+- whether a session is persisted or not
+- presence of `session_id`, `structured_output`, or typed stream events
+- whether the tool actually obtained live file contents
+- whether a follow-up command can recover prior context
+
+Soft assertions should avoid overfitting to:
+
+- ASCII art, welcome-card layout, or ANSI paint
+- exact prose wording of help or trust text
+- exact ordering of non-essential diagnostic rows
+- exact version numbers shown after auto-update
+
+## Core scenarios
+
+### R01. Interactive startup gates a new workspace behind trust
+
+- Entry: run `claude` in a directory that has not yet been trusted interactively.
+- Expect: the first interactive surface is a workspace-trust gate, not an immediate model turn.
+- Expect: the gate explicitly tells the user that Claude Code will be able to read, edit, and execute files in the folder.
+- Expect: accepting trust transitions into a persistent REPL-style shell rather than a one-shot response.
+- Failure signal: a fresh workspace drops directly into tool-capable execution with no trust boundary.
+- Why it matters: startup trust is part of the product, not just a local policy detail.
+
+### R02. Interactive startup lands in a session shell, not a raw request runner
+
+- Entry: accept the trust gate and let the interactive UI finish booting.
+- Expect: a persistent shell appears with a prompt area, shortcut hinting, and session-scoped welcome state.
+- Expect: the user is entering a conversation shell that can continue, not just sending a single request.
+- Failure signal: the rebuild treats `claude` and `claude -p` as the same surface with different formatting.
+- Why it matters: the interactive product is a session container first.
+
+### R03. Headless `-p` is the minimum viability oracle
+
+- Entry: run `claude -p "Reply with exactly READY"`.
+- Expect: the process exits successfully with plain text output and no interactive shell.
+- Failure signal: even the simplest one-shot prompt requires extra protocol framing or a TTY.
+- Why it matters: this is the cheapest smoke test for auth, model reachability, and non-interactive execution.
+
+### R04. `--output-format json` gives a result envelope, not a schema-stable answer body
+
+- Entry: run `claude -p --output-format json` on a prompt that reads a local file and answers structurally.
+- Expect: the final payload includes fields such as `result`, `session_id`, usage and cost metadata, and permission-denial reporting.
+- Expect: the `result` field is still a human-facing answer string and may contain Markdown or code fences.
+- Failure signal: a rebuild assumes `result` itself is the machine-safe contract.
+- Why it matters: human-readable output and machine-readable output are separate concerns.
+
+### R05. `--json-schema` creates a dedicated machine channel
+
+- Entry: run `claude -p --output-format json --json-schema ...`.
+- Expect: the final envelope still contains a prose `result`, but also carries a separate `structured_output` object that matches the schema.
+- Expect: schema enforcement changes completion semantics, not just pretty-print formatting.
+- Failure signal: a rebuild only rewrites the prose answer and never exposes a separately validated structured payload.
+- Why it matters: this is the stable contract for downstream automation.
+
+### R06. `-r <session_id>` restores remembered context
+
+- Entry: create a saved session with a unique token, then resume it by explicit session ID.
+- Expect: the resumed session can answer questions that depend on prior turns.
+- Failure signal: resume reopens a transcript shell but loses the working conversational state that the next turn depends on.
+- Why it matters: session restoration is more than log playback.
+
+### R07. `-c` continues the most recent saved session for the current directory
+
+- Entry: after a saved headless run in one workspace, run `claude -p -c ...` in that same workspace.
+- Expect: the follow-up prompt continues the latest saved conversation for that directory rather than starting cold.
+- Failure signal: `-c` ignores workspace-local session history or chooses an unrelated global session.
+- Why it matters: directory-scoped continuation is one of the fastest everyday loops.
+
+### R08. `--fork-session` keeps context but produces a new session identity
+
+- Entry: resume a known session with `-r <session_id> --fork-session`.
+- Expect: prior context is still available, but the returned `session_id` is new.
+- Failure signal: fork either mutates the original session in place or loses the prior conversation context.
+- Why it matters: branching a session is a different contract from simply reopening it.
+
+### R09. `--no-session-persistence` must break later continuation
+
+- Entry: in a brand-new directory, run one headless prompt with `--no-session-persistence`, then immediately run `claude -p -c ...`.
+- Expect: the second command behaves like no saved session exists for that directory.
+- Failure signal: ephemeral sessions still leak into the resume index or can be discovered by `-c`.
+- Why it matters: automation-grade ephemeral runs need a true non-persistent mode.
+
+### R10. Variadic tool flags need explicit option termination
+
+- Entry: call `claude -p` with `--allowedTools` or `--tools` and a prompt argument.
+- Expect: harnesses use `--` or `--flag=value` style when a variadic option is followed by the prompt.
+- Observed reality: omitting that termination caused the CLI to report that no prompt had been provided, because the variadic tool flag consumed the remaining argv.
+- Failure signal: the E2E harness itself feeds malformed argv and misreads the resulting error as a product failure.
+- Why it matters: a real clone should match the public CLI grammar, and the test harness should call it correctly.
+
+### R11. Tool allowlists must still permit real file reads when configured positively
+
+- Entry: run `claude -p --allowedTools=Read -- "Read README.md and tell me its title only."`
+- Expect: the answer reflects the actual file contents from the live workspace.
+- Failure signal: the allowlist is ignored or the narrowed tool pool loses the ability to perform the admitted read.
+- Why it matters: tool-shaping needs a positive-path oracle, not only denial tests.
+
+### R12. Disabled tools must prevent real file access even if the session still completes
+
+- Entry: run `claude -p --tools '' -- "Read README.md and tell me its title only."`
+- Minimum contract: the session may still produce a final answer, but it must not have real filesystem access.
+- Observed reality: the current build completed the session but emitted pseudo function-call markup and did not recover the actual file contents.
+- Failure signal: disabled-tool runs still read the file successfully.
+- Why it matters: capability loss and process failure are separate; a clone needs tests for both.
+
+### R13. Structured stream output requires `--verbose`
+
+- Entry: run `claude -p --input-format stream-json --output-format stream-json` without `--verbose`.
+- Expect: the CLI fails closed and explains that stream-json output requires verbose mode.
+- Failure signal: a rebuild silently downgrades to another output format or emits partial JSON without the documented gate.
+- Why it matters: protocol surfaces need explicit mode gating.
+
+### R14. Structured input begins with a live `system/init` frame
+
+- Entry: send a valid NDJSON user message through `--input-format stream-json --output-format stream-json --verbose`.
+- Expect: the first output frame is a `system/init` event containing the live `session_id`, tool inventory, model, permission mode, slash-command list, agents, and related bootstrap metadata.
+- Failure signal: the rebuild hides startup catalog state inside an undocumented side channel or only emits a final answer.
+- Why it matters: structured clients need an authoritative initialization record before the first turn completes.
+
+### R15. Structured input validates top-level message types
+
+- Entry: send a malformed stream message whose top-level `type` is not accepted.
+- Expect: the CLI rejects it with a protocol error instead of silently coercing it into transcript text.
+- Observed reality: `user_message` was rejected, and the accepted top-level types were reported as `user` or `control`.
+- Failure signal: protocol drift is accepted quietly and later breaks session semantics.
+- Why it matters: fail-closed protocol validation is part of the SDK-facing contract.
+
+### R16. Replay and partial streaming produce typed lifecycle events
+
+- Entry: run structured I/O with `--replay-user-messages --include-partial-messages`.
+- Expect: the stream contains lifecycle events such as `message_start`, `content_block_start`, `content_block_delta`, `content_block_stop`, `message_delta`, and `message_stop`.
+- Expect: replayed user messages are echoed back as typed user events with replay metadata rather than being folded into assistant text.
+- Expect: a final `result` envelope still arrives after the stream events.
+- Failure signal: clients must scrape terminal prose because lifecycle events never materialize.
+- Why it matters: replica SDK clients need a typed stream, not a screen parser.
+
+### R17. `claude doctor` is a TTY-sensitive operational surface
+
+- Entry: run `claude doctor` once without a TTY and once with a TTY.
+- Expect: non-TTY invocation can fail on terminal raw-mode requirements rather than pretending to be a plain text subcommand.
+- Expect: TTY invocation opens an operational diagnostics view with install, updater, and version-lock information, then dismisses via a continue prompt.
+- Failure signal: a rebuild treats `doctor` as a static text dump and loses its terminal-UI contract.
+- Why it matters: operational health is part of the product surface, not just a hidden admin API.
+
+## Drift-sensitive or investigative scenarios
+
+These scenarios are still worth tracking, but they should not become brittle golden tests until the target build intentionally fixes their semantics.
+
+### X01. Interactive startup can mutate the installed version between runs
+
+- Observed reality: `claude -v` returned `2.1.27` before live usage and `2.1.97` after interactive startup triggered installer migration and auto-update messaging.
+- Testing advice: assert that update state is surfaced and diagnosable, not that a specific version transition must happen.
+
+### X02. `--add-dir` is not yet a reliable denial oracle for headless reads
+
+- Observed reality: a sibling-directory file read succeeded in the current headless test even before `--add-dir` was supplied.
+- Testing advice: keep `--add-dir` in the corpus as an exploratory lane, but do not assume a missing flag must always produce a denial until the rebuild defines its exact filesystem boundary policy.
+
+### X03. Headless permission modes do not necessarily mirror interactive approval UX
+
+- Observed reality: same-workspace file edits succeeded in `default`, `bypassPermissions`, and `dontAsk` headless runs once edit tools were admitted.
+- Testing advice: keep separate E2E lanes for interactive approval prompts versus non-interactive automation, and do not assume permission modes differentiate every same-directory write in `-p`.
+
+## Recommended implementation order for a clone
+
+If the rebuild does not already exist, the fastest confidence-building order is:
+
+1. `R03`, `R04`, and `R05` for one-shot headless viability and machine-readable output.
+2. `R06`, `R07`, `R08`, and `R09` for real session persistence semantics.
+3. `R13`, `R14`, `R15`, and `R16` for typed SDK or headless transport.
+4. `R01`, `R02`, and `R17` for interactive shell reality and operational diagnostics.
+5. `R10`, `R11`, and `R12` for CLI grammar and tool-surface shaping.
+6. `X01`, `X02`, and `X03` as drift watchpoints once the core product is stable.
+
+That order gives the rebuild an end-to-end test ladder that starts with cheap black-box commands and only later depends on full-screen terminal UX or ambiguous permission policy edges.
@@ -2,6 +2,7 @@
 title: "Test Framework Overview"
 owners: [bingran-you]
 soft_links:
+  - /reconstruction-guardrails/verification-and-native-test-oracles/real-cli-e2e-scenario-corpus.md
   - /reconstruction-guardrails/verification-and-native-test-oracles/test-runtime-mode-and-determinism.md
   - /reconstruction-guardrails/verification-and-native-test-oracles/test-environment-fixtures-and-ci-fail-closed-policy.md
   - /reconstruction-guardrails/verification-and-native-test-oracles/test-lane-coverage-map.md
@@ -20,6 +21,8 @@ soft_links:
 
 The current Claude Code snapshot does not expose one self-contained `tests/` directory or runner manifest that answers everything. What it does expose is a layered testing architecture that spans runtime posture, fixtures, dedicated end-to-end harnesses, conformance-sensitive auth flows, and domain-owned contract oracles.
 
+This domain also keeps a live-observed black-box oracle set in [real-cli-e2e-scenario-corpus.md](real-cli-e2e-scenario-corpus.md). That corpus complements the source-snapshot-derived framework view here by recording what a real working CLI actually does when exercised through its public entrypoints.
+
 ## Confirmed layers
 
 The snapshot provides direct signals for all of these verification layer families, even though it does not expose every upstream runner entrypoint: