From 3ad9e6edd6177478a8c2fbd5c16e37f3c09f06d2 Mon Sep 17 00:00:00 2001
From: Bingran You <bingran.you@berkeley.edu>
Date: Thu, 9 Apr 2026 10:11:10 -0700
Subject: [PATCH 1/2] docs: add released Claude CLI E2E test set

---
 .../NODE.md                                   |   1 +
 .../released-cli-e2e-test-set.md              | 200 ++++++++++++++++++
 2 files changed, 201 insertions(+)
 create mode 100644 reconstruction-guardrails/verification-and-native-test-oracles/released-cli-e2e-test-set.md

diff --git a/reconstruction-guardrails/verification-and-native-test-oracles/NODE.md b/reconstruction-guardrails/verification-and-native-test-oracles/NODE.md
index 9e1b46f..627597b 100644
--- a/reconstruction-guardrails/verification-and-native-test-oracles/NODE.md
+++ b/reconstruction-guardrails/verification-and-native-test-oracles/NODE.md
@@ -20,6 +20,7 @@ Relevant leaves:
 - **[test-environment-fixtures-and-ci-fail-closed-policy.md](test-environment-fixtures-and-ci-fail-closed-policy.md)** — How test posture suppresses side effects, how fixture replay works, and why missing recordings fail closed in CI.
 - **[test-lane-coverage-map.md](test-lane-coverage-map.md)** — Which subsystem contracts are guarded by fast regression, integration, end-to-end, conformance, and compatibility lanes, without overclaiming the hidden runner layout.
 - **[e2e-harness-reality-boundaries.md](e2e-harness-reality-boundaries.md)** — Which end-to-end harnesses may shorten setup but still need to preserve real permission, transport, auth-proxy, and credential-cache paths.
+- **[released-cli-e2e-test-set.md](released-cli-e2e-test-set.md)** — Public-runtime end-to-end oracles gathered by exercising a shipped Claude CLI build against a real local workspace, plus the parity-critical cases a rebuild must not skip.
 - **[test-seams-reset-hooks-and-injected-dependencies.md](test-seams-reset-hooks-and-injected-dependencies.md)** — The narrow seams the product uses to keep hard behaviors testable without turning the whole runtime into a debug harness.
 - **[native-test-derived-asset-provenance-and-acceptance-rules.md](native-test-derived-asset-provenance-and-acceptance-rules.md)** — How native test knowledge should be normalized into clean-room contract assets and how those assets should be linked back to their owning domains.
 - **[evidence-levels-and-missing-artifacts.md](evidence-levels-and-missing-artifacts.md)** — What this source snapshot proves, what it only strongly suggests, and which missing artifacts still block exact runner-level reproduction.
diff --git a/reconstruction-guardrails/verification-and-native-test-oracles/released-cli-e2e-test-set.md b/reconstruction-guardrails/verification-and-native-test-oracles/released-cli-e2e-test-set.md
new file mode 100644
index 0000000..c7e1285
--- /dev/null
+++ b/reconstruction-guardrails/verification-and-native-test-oracles/released-cli-e2e-test-set.md
@@ -0,0 +1,200 @@
+---
+title: "Released CLI E2E Test Set"
+owners: [bingran-you]
+soft_links:
+  - /integrations/clients/structured-io-and-headless-session-loop.md
+  - /platform-services/workspace-trust-dialog-and-persistence.md
+  - /platform-services/session-cost-accounting-and-restoration.md
+  - /runtime-orchestration/sessions/resume-path.md
+  - /ui-and-experience/startup-and-onboarding/startup-welcome-dashboard-and-feed-rotation.md
+  - /ui-and-experience/dialogs-and-approvals/permission-prompt-shell-and-worker-states.md
+  - /ui-and-experience/dialogs-and-approvals/structured-diff-rendering-and-highlight-fallback.md
+  - /tools-and-permissions/permissions/e2e-permission-testing-contracts.md
+  - /product-surface/interaction-modes.md
+---
+
+# Released CLI E2E Test Set
+
+Source-derived contracts are still the primary clean-room evidence for this tree, but they are not enough on their own for end-to-end rebuild work. A released CLI can be exercised directly, and its public runtime behavior becomes a second kind of oracle: not hidden implementation, but what a real user actually experiences.
+
+This leaf captures that public-runtime oracle set from a local run of the shipped `claude` CLI on April 9, 2026. The observed build reported version `2.1.89`, authenticated successfully through a Foundry-backed account, and was exercised in both headless and interactive modes on a local macOS terminal.
+
+## Why this leaf exists
+
+A clean-room rebuild can easily pass its own local tests while still feeling unlike the real product at the edges that matter most:
+
+- startup and trust gating
+- headless envelope shape
+- stream transport behavior
+- permission prompts and remembered approvals
+- session durability and cwd-based resume
+- the rhythm of a real coding turn that reads, edits, reruns, and summarizes
+
+Those are all externally visible contracts. They can and should become explicit E2E test targets.
+
+## Evidence boundary
+
+This leaf records only public behavior that was directly observed from the shipped CLI:
+
+- public command-line flags and subcommands
+- public terminal UI flows
+- files the CLI itself wrote into the local user state directory
+- session recovery behavior visible through subsequent CLI invocations
+
+It should not become a transcript dump. Keep raw logs local, and normalize them here into assertions, scenario shapes, and failure modes.
+
+## Mandatory scenario families
+
+### 1. Discovery and health smoke lane
+
+A rebuild should have a fast lane that exercises the released binary surface before any deeper coding workflow:
+
+- `--help` and `--version` must succeed and expose the current command families
+- auth health should be queryable without opening the TUI, including both machine-readable and human-readable status
+- agent discovery should be externally visible from a top-level command, not only from inside an interactive session
+- auto-mode or policy classification should expose an inspectable effective config, not just hidden defaults
+
+The oracle is not one exact text block. The oracle is that these are real, scriptable health surfaces with stable exit behavior.
+
+### 2. Headless `--print` lane
+
+The released CLI exposed several parity-critical headless behaviors:
+
+- a minimal one-shot prompt path in `--print` mode
+- a cheaper `--bare` posture that suppresses much of the normal startup enrichment
+- a budget cap path where a low `--max-budget-usd` can fail before any useful assistant text arrives
+- a JSON envelope path whose result contains both metadata and a human-readable `result`, not only raw assistant text
+- JSON-schema validation that reports structured output separately from the human result text
+- cwd-local continue behavior, where `-c` or `--continue` can answer questions about the previous turn without manually passing a session ID
+
+Equivalent tests should explicitly cover both success and failure envelopes. A rebuild that only checks plain-text success misses one of the most important public automation surfaces.
+
+### 3. Stream-JSON transport lane
+
+The released CLI's stream mode proved several externally visible rules:
+
+- `--print` plus `--output-format=stream-json` requires `--verbose`
+- the stream starts with a system init event before assistant output
+- a valid user input event can be supplied over stdin as JSON
+- `--replay-user-messages` re-emits the incoming user event on stdout
+- partial assistant events and a final result event are separate concepts
+- the current release may emit extra assistant-side content blocks in verbose stream mode beyond the final plain-text answer
+
+The important rebuild rule is not to hardcode today's exact event inventory. It is to make the wire contract explicit and testable:
+
+- init handshake
+- per-event framing
+- replay behavior
+- partial-versus-final separation
+- predictable error handling for malformed input
+
+### 4. Interactive startup and trust lane
+
+A fresh interactive session in a new local workspace did not drop straight into a plain prompt. It first asked the user whether the folder was trusted, then entered a richer startup dashboard with project identity, tips, and recent-session context.
+
+Equivalent tests should protect:
+
+- first-entry trust gating for an unapproved workspace
+- persistence of that trust decision for later launches
+- a startup dashboard rather than a bare REPL prompt
+- a discoverable shortcut overlay
+- predictable terminal-exit handling, including the observed double-`Ctrl-C` confirmation flow
+
+If a rebuild only tests a plain line editor, it will miss the public startup contract users feel first.
+
+### 5. Permissioned coding lane
+
+The most important real-work oracle was a tiny bugfix session in a temporary git workspace:
+
+- the assistant proposed a shell command to run tests and triggered a permission prompt
+- the failing test output was summarized in the UI instead of dumping the entire log at full height
+- file reads surfaced as small activity summaries
+- an edit proposal rendered as a diff preview and required a separate approval
+- remembered approval was scoped to the specific operation class, not globally to every later tool action
+- after the edit, the assistant reran tests, observed success, and gave a short root-cause explanation
+
+This lane matters more than many synthetic tool-loop tests because it captures the actual rhythm a coding user depends on:
+
+- inspect
+- execute
+- approve
+- edit
+- verify
+- summarize
+
+### 6. Durable artifact and resume lane
+
+The released CLI wrote per-project state under `~/.claude/projects/...`, using a sanitized project-path key. The durable transcript for the interactive session was stored as JSONL and included at least these user-visible event families:
+
+- user turns
+- assistant tool requests
+- tool results
+- edit records with structured patch information
+- final assistant text
+
+Two subtle but important persistence facts also showed up:
+
+- `--no-session-persistence` still created the project-scoped directory and memory folder, even though it did not create a normal session transcript
+- cwd-based continue could recover the last session's context without an explicit session ID
+
+A rebuild should therefore test both positive persistence and negative persistence. "No session persistence" is not the same thing as "zero filesystem side effects."
+
+## Current clean-room gap check
+
+The current Python rebuild already protects several useful local lanes:
+
+- local interactive prompt loop and slash commands
+- prompt history and explicit session resume by ID
+- structured NDJSON request-response control flow
+- scenario goldens for review, init, tool loops, permission probes, and compaction
+- basic interactive approval prompting
+
+That is a strong local foundation. It is not yet the same thing as released-CLI E2E parity.
+
+### Capability gaps still visible
+
+The shipped CLI behaviors above imply product areas that the current rebuild does not yet expose as first-class runtime surfaces:
+
+- trust gate and trust persistence
+- startup dashboard and richer terminal startup shell
+- public auth, doctor, plugin, MCP, install, update, and auto-mode command families
+- released-style headless `--print` flag matrix and envelopes
+- public cwd-based `--continue` / `--resume` entrypoint routing
+- richer approval dialogs with remembered decisions scoped by tool or command class
+
+Those are implementation gaps, not merely missing assertions.
+
+### Test gaps even where adjacent features already exist
+
+Even in the areas the rebuild has started, the following cases remain under-protected relative to the released CLI:
+
+- headless budget-cap failure envelopes
+- successful schema validation in the public CLI envelope shape, not only in an internal structured server response
+- `--output-format=stream-json` gating on `--verbose`
+- replayed user-message echoes and partial assistant events in stream mode
+- negative persistence semantics for `--no-session-persistence`
+- cwd-based continue after a prior coding turn
+- remembered approval scopes for shell commands versus edit approvals
+- deny-path behavior for shell or edit approvals in the same real-work session
+- durable session-artifact assertions that inspect the stored transcript shape, not only in-memory summaries
+- non-TTY behavior for commands that actually require raw terminal capabilities
+
+These should become explicit parity tests before claiming a rebuild feels end-to-end correct.
+
+## Reconstruction rule
+
+Use this leaf as the public-runtime complement to the source-derived verification leaves:
+
+- source-derived leaves explain what the product architecture must preserve
+- this leaf explains what a shipped build actually feels like when exercised end to end
+
+A faithful rebuild should keep both. If they disagree, prefer the narrower claim and investigate. Public-runtime behavior is excellent for E2E oracles, but it does not by itself reveal why the product was built that way.
+
+## Failure modes
+
+- **headless false confidence**: the rebuild passes local JSON tests but its public envelope and error modes do not match a released CLI
+- **transport simplification drift**: stream mode works only for a request-response lab client and not for real event-stream consumers
+- **startup flattening**: the rebuild starts in a bare prompt and skips trust, dashboard, or session-restoration cues users rely on
+- **permission amnesia**: approvals exist, but remembered scopes and deny behavior do not match real coding sessions
+- **resume illusion**: explicit session IDs work, but cwd-local continuation and durable transcript recovery do not
+- **negative-path blind spot**: low-budget, malformed-input, non-TTY, and no-persistence behaviors are untested even though real users hit them first

From c980f6cd82b88bb9298b29cf71fffdeb94ca6902 Mon Sep 17 00:00:00 2001
From: Bingran You <bingran.you@berkeley.edu>
Date: Thu, 9 Apr 2026 13:20:07 -0700
Subject: [PATCH 2/2] docs: expand released Claude CLI native coverage matrix

---
 .../released-cli-e2e-test-set.md              | 380 +++++++++++-------
 1 file changed, 230 insertions(+), 150 deletions(-)

diff --git a/reconstruction-guardrails/verification-and-native-test-oracles/released-cli-e2e-test-set.md b/reconstruction-guardrails/verification-and-native-test-oracles/released-cli-e2e-test-set.md
index c7e1285..727ccad 100644
--- a/reconstruction-guardrails/verification-and-native-test-oracles/released-cli-e2e-test-set.md
+++ b/reconstruction-guardrails/verification-and-native-test-oracles/released-cli-e2e-test-set.md
@@ -5,11 +5,15 @@ soft_links:
   - /integrations/clients/structured-io-and-headless-session-loop.md
   - /platform-services/workspace-trust-dialog-and-persistence.md
   - /platform-services/session-cost-accounting-and-restoration.md
+  - /platform-services/auth-config-and-policy.md
   - /runtime-orchestration/sessions/resume-path.md
   - /ui-and-experience/startup-and-onboarding/startup-welcome-dashboard-and-feed-rotation.md
   - /ui-and-experience/dialogs-and-approvals/permission-prompt-shell-and-worker-states.md
   - /ui-and-experience/dialogs-and-approvals/structured-diff-rendering-and-highlight-fallback.md
   - /tools-and-permissions/permissions/e2e-permission-testing-contracts.md
+  - /integrations/plugins/plugin-management-and-marketplace-flows.md
+  - /integrations/mcp/config-layering-policy-and-dedup.md
+  - /collaboration-and-agents/remote-control-entrypoints-and-startup-preferences.md
   - /product-surface/interaction-modes.md
 ---
 
@@ -17,184 +21,260 @@ soft_links:
 
 Source-derived contracts are still the primary clean-room evidence for this tree, but they are not enough on their own for end-to-end rebuild work. A released CLI can be exercised directly, and its public runtime behavior becomes a second kind of oracle: not hidden implementation, but what a real user actually experiences.
 
-This leaf captures that public-runtime oracle set from a local run of the shipped `claude` CLI on April 9, 2026. The observed build reported version `2.1.89`, authenticated successfully through a Foundry-backed account, and was exercised in both headless and interactive modes on a local macOS terminal.
+This leaf captures that public-runtime oracle set from a local macOS run of the shipped `claude` CLI on April 9, 2026.
 
-## Why this leaf exists
+## Snapshot and evidence boundary
 
-A clean-room rebuild can easily pass its own local tests while still feeling unlike the real product at the edges that matter most:
+Final authoritative evidence in this leaf comes from the native Mach-O binary at `~/.local/bin/claude`, which reported version `2.1.96 (Claude Code)`.
 
-- startup and trust gating
-- headless envelope shape
-- stream transport behavior
-- permission prompts and remembered approvals
-- session durability and cwd-based resume
-- the rhythm of a real coding turn that reads, edits, reruns, and summarizes
+An earlier pass on this machine discovered an older `2.1.89` rewrite-backed shim in `~/.local/share/claude/versions/2.1.89`. That shim is not the right oracle for the shipped native CLI. Do not treat shim-era results as authoritative for released behavior.
 
-Those are all externally visible contracts. They can and should become explicit E2E test targets.
-
-## Evidence boundary
-
-This leaf records only public behavior that was directly observed from the shipped CLI:
+The test set below records only public behavior that was directly observed from the shipped CLI:
 
 - public command-line flags and subcommands
 - public terminal UI flows
-- files the CLI itself wrote into the local user state directory
-- session recovery behavior visible through subsequent CLI invocations
+- files the CLI itself wrote into local user state
+- session recovery behavior visible through later CLI invocations
+- browser-login and subscription boundaries visible from the CLI surface
 
 It should not become a transcript dump. Keep raw logs local, and normalize them here into assertions, scenario shapes, and failure modes.
 
-## Mandatory scenario families
-
-### 1. Discovery and health smoke lane
-
-A rebuild should have a fast lane that exercises the released binary surface before any deeper coding workflow:
-
-- `--help` and `--version` must succeed and expose the current command families
-- auth health should be queryable without opening the TUI, including both machine-readable and human-readable status
-- agent discovery should be externally visible from a top-level command, not only from inside an interactive session
-- auto-mode or policy classification should expose an inspectable effective config, not just hidden defaults
-
-The oracle is not one exact text block. The oracle is that these are real, scriptable health surfaces with stable exit behavior.
-
-### 2. Headless `--print` lane
-
-The released CLI exposed several parity-critical headless behaviors:
-
-- a minimal one-shot prompt path in `--print` mode
-- a cheaper `--bare` posture that suppresses much of the normal startup enrichment
-- a budget cap path where a low `--max-budget-usd` can fail before any useful assistant text arrives
-- a JSON envelope path whose result contains both metadata and a human-readable `result`, not only raw assistant text
-- JSON-schema validation that reports structured output separately from the human result text
-- cwd-local continue behavior, where `-c` or `--continue` can answer questions about the previous turn without manually passing a session ID
-
-Equivalent tests should explicitly cover both success and failure envelopes. A rebuild that only checks plain-text success misses one of the most important public automation surfaces.
-
-### 3. Stream-JSON transport lane
-
-The released CLI's stream mode proved several externally visible rules:
-
-- `--print` plus `--output-format=stream-json` requires `--verbose`
-- the stream starts with a system init event before assistant output
-- a valid user input event can be supplied over stdin as JSON
-- `--replay-user-messages` re-emits the incoming user event on stdout
-- partial assistant events and a final result event are separate concepts
-- the current release may emit extra assistant-side content blocks in verbose stream mode beyond the final plain-text answer
-
-The important rebuild rule is not to hardcode today's exact event inventory. It is to make the wire contract explicit and testable:
-
-- init handshake
-- per-event framing
-- replay behavior
-- partial-versus-final separation
-- predictable error handling for malformed input
-
-### 4. Interactive startup and trust lane
-
-A fresh interactive session in a new local workspace did not drop straight into a plain prompt. It first asked the user whether the folder was trusted, then entered a richer startup dashboard with project identity, tips, and recent-session context.
-
-Equivalent tests should protect:
-
-- first-entry trust gating for an unapproved workspace
-- persistence of that trust decision for later launches
-- a startup dashboard rather than a bare REPL prompt
-- a discoverable shortcut overlay
-- predictable terminal-exit handling, including the observed double-`Ctrl-C` confirmation flow
-
-If a rebuild only tests a plain line editor, it will miss the public startup contract users feel first.
-
-### 5. Permissioned coding lane
-
-The most important real-work oracle was a tiny bugfix session in a temporary git workspace:
-
-- the assistant proposed a shell command to run tests and triggered a permission prompt
-- the failing test output was summarized in the UI instead of dumping the entire log at full height
-- file reads surfaced as small activity summaries
-- an edit proposal rendered as a diff preview and required a separate approval
-- remembered approval was scoped to the specific operation class, not globally to every later tool action
-- after the edit, the assistant reran tests, observed success, and gave a short root-cause explanation
-
-This lane matters more than many synthetic tool-loop tests because it captures the actual rhythm a coding user depends on:
-
-- inspect
-- execute
-- approve
-- edit
-- verify
-- summarize
-
-### 6. Durable artifact and resume lane
-
-The released CLI wrote per-project state under `~/.claude/projects/...`, using a sanitized project-path key. The durable transcript for the interactive session was stored as JSONL and included at least these user-visible event families:
-
-- user turns
-- assistant tool requests
-- tool results
-- edit records with structured patch information
-- final assistant text
-
-Two subtle but important persistence facts also showed up:
-
-- `--no-session-persistence` still created the project-scoped directory and memory folder, even though it did not create a normal session transcript
-- cwd-based continue could recover the last session's context without an explicit session ID
-
-A rebuild should therefore test both positive persistence and negative persistence. "No session persistence" is not the same thing as "zero filesystem side effects."
-
-## Current clean-room gap check
-
-The current Python rebuild already protects several useful local lanes:
-
-- local interactive prompt loop and slash commands
-- prompt history and explicit session resume by ID
-- structured NDJSON request-response control flow
-- scenario goldens for review, init, tool loops, permission probes, and compaction
-- basic interactive approval prompting
+## Machine-specific provider boundary
+
+This machine does support provider-backed local Claude CLI use without a Claude account login, but the working provider state is not purely ambient.
+
+Observed machine-specific rules:
+
+- the working provider path was `authMethod: third_party` with `apiProvider: foundry`
+- `claude auth status --text` reported `API provider: Microsoft Foundry` and resource `knowhiz-service-openai-backup-2`
+- a fresh isolated `HOME` without copied Claude settings reported `loggedIn: false` and headless prompt mode failed with `Not logged in · Please run /login`
+- copying only `~/.claude/settings.json` into the isolated `HOME` restored provider-backed prompt success
+
+The practical reconstruction rule is that provider-backed local testing on this machine must seed the Claude settings layer that carries Foundry configuration. A blank home directory is not enough.
+
+## Coverage legend
+
+- `PASS`: directly exercised and externally visible
+- `PASS (nuanced)`: exercised, but the observed contract has caveats that a rebuild must preserve
+- `PARTIAL`: the flag or command was accepted or partially observed, but its full effect needs external setup not present in this lane
+- `ACCOUNT-BOUND`: Azure or Foundry credentials alone were insufficient; the path clearly required first-party Anthropic account state
+- `NOT MEANINGFUL HERE`: the flag is real, but this machine or session posture could not surface a distinct runtime effect without a different precondition
+
+## Provider-backed local root-flag matrix
+
+The table below is the practical E2E matrix for the `claude --help` root surface as observed on native `2.1.96`.
+
+| Root surface | Status | Observed contract |
+| --- | --- | --- |
+| `--help` | `PASS` | Enumerated the public root flags and command families. |
+| `--version` | `PASS` | Reported `2.1.96 (Claude Code)`. |
+| `-p, --print` | `PASS` | Headless one-shot prompt path worked in both text and JSON envelopes. |
+| `--bare` | `PASS` | Reduced startup surface and still worked with Foundry-backed prompts. |
+| `--model` | `PASS` | `--model sonnet` resolved and returned successful prompts. |
+| `--output-format text` | `PASS` | Returned plain assistant text. |
+| `--output-format json` | `PASS` | Returned a metadata envelope with `type`, `subtype`, `result`, `session_id`, `total_cost_usd`, and usage metadata. |
+| `--json-schema` | `PASS` | Returned success with `structured_output` separated from the human-readable `result`. |
+| `--max-budget-usd` | `PASS` | Low budgets could fail before useful text; both the plain-text error path and JSON `error_max_budget_usd` envelope were observed. |
+| `--output-format stream-json` | `PASS` | In `--print` mode it required `--verbose`; the stream started with a `system/init` event and ended with a final `result` event. |
+| `--verbose` | `PASS` | Enabled streamed event output and surfaced additional assistant-side event blocks. |
+| `--input-format stream-json` | `PASS` | Accepted inbound NDJSON user events and drove a streaming response loop. |
+| `--replay-user-messages` | `PASS` | Re-emitted the inbound user event back to stdout as a replayed user message. |
+| `--include-partial-messages` | `PASS` | Surfaced fine-grained stream events such as `content_block_delta` before the final message. |
+| malformed `--input-format stream-json` input | `PASS` | Invalid JSON lines failed closed with a parse error instead of silent fallback. |
+| `--session-id` | `PASS` | Forced a deterministic session ID for later resume. |
+| `-r, --resume` | `PASS` | Resumed a prior headless session by explicit session ID and preserved earlier context. |
+| `--fork-session` | `PASS` | Created a new session ID while resuming previous context. |
+| `-c, --continue` | `PASS (nuanced)` | In native `2.1.96` print mode it did not reliably resume the prior headless print session on this machine; it created a fresh session in the tested workspaces. A rebuild should not assume older snapshot behavior here. |
+| `--no-session-persistence` | `PASS (nuanced)` | In native `2.1.96` on this machine, no per-project transcript directory for the no-persist workspace was created. This differs from older observed behavior and should be treated as a version-sensitive contract. |
+| `--add-dir` | `PASS` | Reading `../extra/context.txt` outside the workspace was denied without `--add-dir`, then succeeded with `--add-dir ../extra`. |
+| `--system-prompt` | `PASS` | A system-only keyword could be injected and retrieved later in the same one-shot session. |
+| `--append-system-prompt` | `PASS` | An appended system-only keyword could be injected and retrieved later in the same one-shot session. |
+| `--agents` plus `--agent` | `PASS` | A custom JSON-defined agent could be injected and selected for a headless turn. |
+| `--tools` | `PASS` | Tool surface could be constrained to `Read` or `Bash` and the session adapted accordingly. |
+| `--allowed-tools` | `PASS` | A narrowed allow rule such as `Bash(pwd:*)` still executed the intended command successfully. |
+| `--disallowed-tools` | `PASS (nuanced)` | Disallowing `Bash` did not return a hard CLI error; the assistant instead answered in text with a pseudo-command block. Rebuilds should test for this externally visible fallback behavior, not only an internal deny bit. |
+| `--permission-mode bypassPermissions` | `PASS` | Suppressed ordinary tool approval prompts in headless mode for Bash and MCP lanes. |
+| `--settings <path>` | `PASS` | File-backed settings injected environment visible to Bash. |
+| `--settings <inline-json>` | `PASS` | Inline JSON settings injected environment visible to Bash. |
+| `--setting-sources` | `PASS` | `user,project` loaded a project-only environment marker; `user` alone filtered it out. |
+| `--plugin-dir` | `PASS` | Session-only plugin loading worked; its slash command appeared in streamed init output and executed via `/local-plugin:plugin-test`. |
+| `--disable-slash-commands` | `PASS` | Disabled plugin-loaded slash commands; the same plugin command became `Unknown skill`. |
+| `--mcp-config` | `PASS` | Both file-backed and inline JSON MCP config loaded into the live headless tool surface. |
+| `--strict-mcp-config` | `PASS` | Restricted the session to only the explicitly supplied MCP config. |
+| `--debug-file` | `PASS` | Wrote startup and runtime debug logs to the requested file. |
+| `-w, --worktree` | `PASS` | Created a git worktree under `.claude/worktrees/<name>` and executed the turn from that worktree path. |
+| `--tmux` | `PASS (nuanced)` | In a headless non-terminal run it created the worktree, then failed with `open terminal failed: not a terminal`. The pre-terminal worktree side effect is part of the public behavior. |
+| `--ide` | `PARTIAL` | Accepted in headless prompt mode and did not block the turn, but no separate externally visible IDE-attachment effect surfaced in this lane. |
+| `--chrome` | `PARTIAL` | Accepted in headless prompt mode and did not block the turn, but no separate externally visible Chrome-attachment effect surfaced in this lane. |
+| `--no-chrome` | `PASS` | Parsed and ran successfully as the inverse startup toggle. |
+| `--effort low` | `PASS` | Parsed and ran successfully in the provider-backed headless lane. |
+| `-n, --name` | `PASS` | Parsed and ran successfully as a session display-name override. |
+| `--mcp-debug` | `PASS` | Deprecated alias still parsed and ran in the provider-backed headless lane. |
+| `--brief` | `PASS` | Parsed and ran in headless prompt mode, though this lane did not surface a distinct SendUserMessage interaction. |
+
+## Root flags that exist but were not fully meaningful in this provider-backed lane
+
+| Root surface | Status | Why it was not a complete local provider-backed oracle on this machine |
+| --- | --- | --- |
+| `--allow-dangerously-skip-permissions` | `NOT MEANINGFUL HERE` | The externally visible permission-bypass contract was already exercised through `--permission-mode bypassPermissions`. |
+| `--dangerously-skip-permissions` | `NOT MEANINGFUL HERE` | Same as above; no distinct E2E contract beyond the bypass posture already observed. |
+| `-d, --debug` | `NOT MEANINGFUL HERE` | `--debug-file` already exercised the public debug-log side effect without flooding the terminal transcript. |
+| `--fallback-model` | `NOT MEANINGFUL HERE` | Only matters under model overload or failure; that condition was not naturally present during this run. |
+| `--file` | `NOT MEANINGFUL HERE` | Depends on prior first-party file-upload state and `file_id` handles, which were not part of this Azure-backed local lane. |
+| `--from-pr` | `NOT MEANINGFUL HERE` | Depends on PR-linked session metadata rather than a purely local provider-backed workspace. |
+| `--include-hook-events` | `NOT MEANINGFUL HERE` | Hook streaming is only meaningful when hooks are configured and firing. |
+| `--betas` | `NOT MEANINGFUL HERE` | The help text explicitly scopes this to API-key users; this machine's authoritative lane was Foundry third-party auth. |
+| `--remote-control-session-name-prefix` | `ACCOUNT-BOUND` | Only meaningful when Remote Control itself is available, which it was not under Azure-only credentials. |
+
+## Provider-backed local command-family matrix
+
+| Command family | Status | Observed contract |
+| --- | --- | --- |
+| `claude auth status --json` | `PASS` | Returned `loggedIn: true`, `authMethod: third_party`, `apiProvider: foundry`. |
+| `claude auth status --text` | `PASS` | Returned Microsoft Foundry human-readable status. |
+| `claude auth logout` | `PASS (nuanced)` | Logged out of Anthropic account state, but did not disable the working Foundry third-party provider path; `auth status` still returned `third_party/foundry`. |
+| `claude agents` | `PASS` | Listed built-in agents from a top-level command. |
+| `claude auto-mode config` | `PASS` | Returned effective JSON config. |
+| `claude auto-mode defaults` | `PASS` | Returned the default allow, soft-deny, and environment rule set. |
+| `claude auto-mode critique` | `PASS` | Returned the explicit empty-state message when no custom rules existed. |
+| `claude doctor` in non-TTY | `PASS (nuanced)` | Failed closed with Ink raw-mode errors, proving this is not a plain pipe-friendly command. |
+| `claude doctor` in TTY | `PASS` | Returned diagnostics including running version, stable version, latest version, PATH warning, and keychain warning. |
+| `claude install stable` | `PASS (nuanced)` | In an isolated `HOME`, installed a native build and launcher successfully, but the installed stable version was `2.1.89`, not the actively used `2.1.96` binary. |
+| `claude update` | `PASS (nuanced)` | In the isolated install root, reported `2.1.89` as up to date on the stable channel. |
+| `claude plugin validate` | `PASS` | Validated local plugin manifests and returned warnings without failing. |
+| `claude plugin marketplace add/list/update/remove` | `PASS` | Worked end to end against a local git-backed marketplace repo. |
+| `claude plugin install/list/disable/enable/uninstall` | `PASS` | Worked end to end against a local plugin from that marketplace. |
+| `claude plugin update` | `PASS (nuanced)` | Required the full plugin ID `plugin@marketplace`; after update it said restart was required, `plugin list --json` still showed the old version, but a new session executed the updated plugin command content. |
+| `claude mcp add/list/get/remove/reset-project-choices` | `PASS` | Worked for local stdio servers and project or user scope. |
+| `claude mcp add-json` | `PASS` | Added a stdio server from inline JSON. |
+| `claude mcp add-from-claude-desktop` in non-TTY | `PASS (nuanced)` | Failed with Ink raw-mode requirements, showing it is interactive UI rather than a plain pipe command. |
+| `claude mcp add-from-claude-desktop` in TTY | `PASS` | Imported a synthetic Claude Desktop server after an interactive checklist UI. |
+| `claude mcp serve` | `PASS` | Exposed Claude Code's own MCP server surface and returned a large tool catalog over JSON-RPC. |
+| MCP tool use inside headless session | `PASS (nuanced)` | MCP tools loaded correctly, but by default their use still triggered permission denials; the same tool call passed under `--permission-mode bypassPermissions`. |
+
+## Interactive startup and coding lane
+
+The interactive REPL still matters more than many synthetic headless loops because it captures what a real coding user actually feels.
+
+### Fresh-start startup contract
+
+Observed on a fresh isolated `HOME`:
+
+- the first launch did not go straight to a prompt
+- it first showed theme selection onboarding
+- it then showed security notes
+- it then showed the workspace trust gate with a binary trust or exit choice
+- after trust, it entered the richer welcome dashboard with project identity, tips, and shortcut affordances
+
+Observed on the second launch in the same trusted workspace:
+
+- the trust gate was skipped
+- startup went straight to the welcome dashboard
+
+That makes trust persistence and first-run onboarding explicit E2E surfaces, not optional polish.
+
+### Real coding turn contract
+
+In a tiny temporary git workspace containing a broken Python function and test:
+
+- Claude first proposed `npm test`, which triggered a Bash permission prompt
+- the resulting `ENOENT` failure was summarized in a compact UI block instead of dumping full scrollback
+- it searched the workspace and pivoted to `python -m pytest`, which triggered another permission prompt
+- the failing Python assertion was summarized in the UI
+- it read `mathlib.py`
+- it proposed an edit and rendered a structured diff approval dialog
+- after approval, it reran tests and summarized the passing result
+- it ended with a one-sentence root-cause explanation
+
+One subtle behavior was especially important:
+
+- selecting `Yes, and don't ask again for: python:*` on the first Python test run did **not** suppress the later Python rerun prompt in the same session
+
+A rebuild should therefore test the remembered-approval scope exactly, not just whether approvals exist in principle.
+
+### Interactive transcript artifact shape
+
+The resulting interactive transcript was stored under `~/.claude/projects/<sanitized-path>/...jsonl`.
+
+The observed `2.1.96` transcript for the coding lane included these top-level record families:
+
+- `assistant`
+- `user`
+- `attachment`
+- `file-history-snapshot`
+- `permission-mode`
+- `last-prompt`
+
+Structured `toolUseResult` payloads inside `user` records captured at least these public result shapes:
+
+- Bash stdout and stderr for command runs
+- workspace file listings from search or glob-style actions
+- full text test failure output
+- structured file reads with `filePath`, `content`, and line metadata
+- edit records with `structuredPatch`, `oldString`, `newString`, and `userModified`
+
+That stored-shape contract matters for replay and parity tests even if a rebuild uses a different internal implementation.
+
+## Account-bound and first-party-only matrix
+
+The user request for this leaf was explicit: separate provider-backed local flows from flows that Azure credentials alone do not unlock.
 
-That is a strong local foundation. It is not yet the same thing as released-CLI E2E parity.
+The matrix below records the observed boundary.
 
-### Capability gaps still visible
+| Surface | Status | Exact observed boundary |
+| --- | --- | --- |
+| ordinary headless prompts via Foundry | `PASS` | Worked with Azure-backed Foundry settings and no Claude account login. |
+| `claude auth status` | `PASS` | Correctly reflected the third-party provider path. |
+| `claude auth login` | `ACCOUNT-BOUND` | Opened a browser flow against `https://claude.com/...`; Azure credentials alone did not satisfy it. |
+| `claude auth login --console` | `ACCOUNT-BOUND` | Opened a browser flow against `https://platform.claude.com/...`; still required first-party Anthropic account login, not Azure-only state. |
+| `claude setup-token` | `ACCOUNT-BOUND` | Warned that environment or helper auth already existed, then still opened a browser sign-in flow and prompted for an OAuth code. Azure credentials alone were insufficient. |
+| `claude remote-control` | `ACCOUNT-BOUND` | Failed immediately with `You must be logged in to use Remote Control` and explicitly said the feature is only available with `claude.ai` subscriptions. |
+| `claude auth logout` | `PASS (nuanced)` | Only removed Anthropic account state; it did not log out the third-party Foundry provider path. |
 
-The shipped CLI behaviors above imply product areas that the current rebuild does not yet expose as first-class runtime surfaces:
+The reconstruction rule is straightforward:
 
-- trust gate and trust persistence
-- startup dashboard and richer terminal startup shell
-- public auth, doctor, plugin, MCP, install, update, and auto-mode command families
-- released-style headless `--print` flag matrix and envelopes
-- public cwd-based `--continue` / `--resume` entrypoint routing
-- richer approval dialogs with remembered decisions scoped by tool or command class
+- Azure or Foundry credentials are enough for local prompt, tool, plugin, MCP, and ordinary REPL flows on this machine
+- Azure or Foundry credentials are **not** enough for first-party browser-login flows, long-lived OAuth token setup, or Remote Control subscription flows
 
-Those are implementation gaps, not merely missing assertions.
+## Cases a reconstruction still misses if it stops at the earlier local test set
 
-### Test gaps even where adjacent features already exist
+The current clean-room rewrite and earlier oracle leaves already covered useful structured and local harness lanes. They still miss several released-CLI cases that this native `2.1.96` run made concrete.
 
-Even in the areas the rebuild has started, the following cases remain under-protected relative to the released CLI:
+These are the highest-value missing or version-sensitive cases to add:
 
-- headless budget-cap failure envelopes
-- successful schema validation in the public CLI envelope shape, not only in an internal structured server response
-- `--output-format=stream-json` gating on `--verbose`
-- replayed user-message echoes and partial assistant events in stream mode
-- negative persistence semantics for `--no-session-persistence`
-- cwd-based continue after a prior coding turn
-- remembered approval scopes for shell commands versus edit approvals
-- deny-path behavior for shell or edit approvals in the same real-work session
-- durable session-artifact assertions that inspect the stored transcript shape, not only in-memory summaries
-- non-TTY behavior for commands that actually require raw terminal capabilities
+- first-run theme onboarding before trust acceptance
+- trust persistence causing the second interactive launch to skip the trust gate
+- provider-backed local auth depending on Claude settings bootstrap, not only shell environment
+- native-versus-stable maintenance drift: active native binary `2.1.96`, isolated stable install `2.1.89`, doctor-reported latest `2.1.98`
+- non-TTY failure behavior for `doctor` and `mcp add-from-claude-desktop`
+- stream-json init, replay, partial deltas, and malformed-input failure paths
+- explicit JSON `error_max_budget_usd` envelopes
+- `--continue` behavior drift in native `2.1.96` headless print mode
+- `--no-session-persistence` drift in native `2.1.96` on this machine
+- `--disable-slash-commands` disabling plugin-loaded slash commands
+- `--worktree` side effects and `--tmux` failing only after the worktree already exists
+- MCP strict-config loading plus permission-denied tool calls
+- plugin update requiring restart, full `plugin@marketplace` addressing, and post-update version-report drift
+- remembered-approval scope not suppressing a later Python rerun prompt even after choosing `don't ask again for: python:*`
+- account-bound browser and subscription flows being distinct from provider-backed local prompts
 
-These should become explicit parity tests before claiming a rebuild feels end-to-end correct.
+If the rebuild test suite does not assert these, it is still missing released-CLI parity, even if its internal golden tests look healthy.
 
 ## Reconstruction rule
 
 Use this leaf as the public-runtime complement to the source-derived verification leaves:
 
 - source-derived leaves explain what the product architecture must preserve
-- this leaf explains what a shipped build actually feels like when exercised end to end
+- this leaf explains what a shipped build actually feels like when exercised end to end on a real machine
 
 A faithful rebuild should keep both. If they disagree, prefer the narrower claim and investigate. Public-runtime behavior is excellent for E2E oracles, but it does not by itself reveal why the product was built that way.
 
 ## Failure modes
 
 - **headless false confidence**: the rebuild passes local JSON tests but its public envelope and error modes do not match a released CLI
-- **transport simplification drift**: stream mode works only for a request-response lab client and not for real event-stream consumers
-- **startup flattening**: the rebuild starts in a bare prompt and skips trust, dashboard, or session-restoration cues users rely on
-- **permission amnesia**: approvals exist, but remembered scopes and deny behavior do not match real coding sessions
-- **resume illusion**: explicit session IDs work, but cwd-local continuation and durable transcript recovery do not
-- **negative-path blind spot**: low-budget, malformed-input, non-TTY, and no-persistence behaviors are untested even though real users hit them first
+- **provider bootstrap blind spot**: the rebuild assumes provider auth is ambient and misses the settings-layer bootstrap required on real machines
+- **startup flattening**: the rebuild starts in a bare prompt and skips theme onboarding, trust, or dashboard cues users actually see
+- **maintenance-lane drift**: install, update, and doctor behaviors are untested, so version-channel mismatches go unnoticed
+- **permission-memory drift**: approvals exist, but the remembered-scope behavior does not match a real coding session
+- **integration-toggle overclaim**: flags like `--tmux`, `--ide`, `--chrome`, `--plugin-dir`, or `--mcp-config` are parsed but their real side effects are not asserted
+- **account-bound confusion**: Azure-backed local success is mistakenly treated as proof that Remote Control, setup-token, or browser login flows are covered