garrytan · garrytan · Mar 14, 2026 · Mar 14, 2026 · Mar 14, 2026 · Mar 14, 2026
diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
@@ -189,15 +189,15 @@ Three reasons:
 2. **CI can validate freshness.** `gen:skill-docs --dry-run` + `git diff --exit-code` catches stale docs before merge.
 3. **Git blame works.** You can see when a command was added and in which commit.
 
-### Test tiers
+### Template test tiers
 
 | Tier | What | Cost | Speed |
 |------|------|------|-------|
 | 1 — Static validation | Parse every `$B` command in SKILL.md, validate against registry | Free | <2s |
-| 2 — E2E via Agent SDK | Spawn real Claude session, run `/qa`, check for errors | ~$0.50 | ~60s |
-| 3 — LLM-as-judge | Haiku scores docs on clarity/completeness/actionability | ~$0.03 | ~10s |
+| 2 — E2E via `claude -p` | Spawn real Claude session, run each skill, check for errors | ~$3.85 | ~20min |
+| 3 — LLM-as-judge | Sonnet scores docs on clarity/completeness/actionability | ~$0.15 | ~30s |
 
-Tier 1 runs on every `bun test`. Tier 2 and 3 are gated behind env vars. The idea is: catch 95% of issues for free, use LLMs only for the judgment calls.
+Tier 1 runs on every `bun test`. Tiers 2+3 are gated behind `EVALS=1`. The idea is: catch 95% of issues for free, use LLMs only for judgment calls.
 
 ## Command dispatch
 
@@ -231,6 +231,88 @@ Playwright's native errors are rewritten through `wrapError()` to strip internal
 
 The server doesn't try to self-heal. If Chromium crashes (`browser.on('disconnected')`), the server exits immediately. The CLI detects the dead server on the next command and auto-restarts. This is simpler and more reliable than trying to reconnect to a half-dead browser process.
 
+## E2E test infrastructure
+
+### Session runner (`test/helpers/session-runner.ts`)
+
+E2E tests spawn `claude -p` as a completely independent subprocess — not via the Agent SDK, which can't nest inside Claude Code sessions. The runner:
+
+1. Writes the prompt to a temp file (avoids shell escaping issues)
+2. Spawns `sh -c 'cat prompt | claude -p --output-format stream-json --verbose'`
+3. Streams NDJSON from stdout for real-time progress
+4. Races against a configurable timeout
+5. Parses the full NDJSON transcript into structured results
+
+The `parseNDJSON()` function is pure — no I/O, no side effects — making it independently testable.
+
+### Observability data flow
+
+```
+  skill-e2e.test.ts
+        │
+        │ generates runId, passes testName + runId to each call
+        │
+  ┌─────┼──────────────────────────────┐
+  │     │                              │
+  │  runSkillTest()              evalCollector
+  │  (session-runner.ts)         (eval-store.ts)
+  │     │                              │
+  │  per tool call:              per addTest():
+  │  ┌──┼──────────┐              savePartial()
+  │  │  │          │                   │
+  │  ▼  ▼          ▼                   ▼
+  │ [HB] [PL]    [NJ]          _partial-e2e.json
+  │  │    │        │             (atomic overwrite)
+  │  │    │        │
+  │  ▼    ▼        ▼
+  │ e2e-  prog-  {name}
+  │ live  ress   .ndjson
+  │ .json .log
+  │
+  │  on failure:
+  │  {name}-failure.json
+  │
+  │  ALL files in ~/.gstack-dev/
+  │  Run dir: e2e-runs/{runId}/
+  │
+  │         eval-watch.ts
+  │              │
+  │        ┌─────┴─────┐
+  │     read HB     read partial
+  │        └─────┬─────┘
+  │              ▼
+  │        render dashboard
+  │        (stale >10min? warn)
+```
+
+**Split ownership:** session-runner owns the heartbeat (current test state), eval-store owns partial results (completed test state). The watcher reads both. Neither component knows about the other — they share data only through the filesystem.
+
+**Non-fatal everything:** All observability I/O is wrapped in try/catch. A write failure never causes a test to fail. The tests themselves are the source of truth; observability is best-effort.
+
+**Machine-readable diagnostics:** Each test result includes `exit_reason` (success, timeout, error_max_turns, error_api, exit_code_N), `timeout_at_turn`, and `last_tool_call`. This enables `jq` queries like:
+```bash
+jq '.tests[] | select(.exit_reason == "timeout") | .last_tool_call' ~/.gstack-dev/evals/_partial-e2e.json
+```
+
+### Eval persistence (`test/helpers/eval-store.ts`)
+
+The `EvalCollector` accumulates test results and writes them in two ways:
+
+1. **Incremental:** `savePartial()` writes `_partial-e2e.json` after each test (atomic: write `.tmp`, `fs.renameSync`). Survives kills.
+2. **Final:** `finalize()` writes a timestamped eval file (e.g. `e2e-20260314-143022.json`). The partial file is never cleaned up — it persists alongside the final file for observability.
+
+`eval:compare` diffs two eval runs. `eval:summary` aggregates stats across all runs in `~/.gstack-dev/evals/`.
+
+### Test tiers
+
+| Tier | What | Cost | Speed |
+|------|------|------|-------|
+| 1 — Static validation | Parse `$B` commands, validate against registry, observability unit tests | Free | <5s |
+| 2 — E2E via `claude -p` | Spawn real Claude session, run each skill, scan for errors | ~$3.85 | ~20min |
+| 3 — LLM-as-judge | Sonnet scores docs on clarity/completeness/actionability | ~$0.15 | ~30s |
+
+Tier 1 runs on every `bun test`. Tiers 2+3 are gated behind `EVALS=1`. The idea: catch 95% of issues for free, use LLMs only for judgment calls and integration testing.
+
 ## What's intentionally not here
 
 - **No WebSocket streaming.** HTTP request/response is simpler, debuggable with curl, and fast enough. Streaming would add complexity for marginal benefit.

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,55 +1,38 @@
 # Changelog
 
-## 0.3.4 — 2026-03-13
+## 0.3.6 — 2026-03-14
 
 ### Added
-- **Daily update check** — all 9 skills now check for new versions once per day via `bin/gstack-update-check` (pure bash, <5ms cached). Prompts user via AskUserQuestion with option to upgrade or defer 24h.
-- **`/gstack-upgrade` skill** — standalone upgrade command that detects install type (global-git, local-git, vendored), upgrades, and shows a "What's New" summary from CHANGELOG
-- **"Just upgraded" confirmation** — after upgrading, the next skill invocation shows "Running gstack v{new} (just updated!)" via `~/.gstack/just-upgraded-from` marker
-- **`AskUserQuestion` added to 5 skills** — gstack (root), browse, qa, retro, setup-browser-cookies now have AskUserQuestion in allowed-tools for upgrade prompts
-- **`Bash` added to plan-eng-review** — enables the update check preamble to run in plan review sessions
-- `browse/test/gstack-update-check.test.ts` — 10 test cases covering all script branch paths with `GSTACK_REMOTE_URL` env var for test isolation
-- `TODOS.md` for tracking deferred work
-
-### Changed
-- **Version check is now one system** — removed SHA-based `checkVersion()` from `browse/src/find-browse.ts` (~120 lines deleted) and `browse/test/find-browse.test.ts` (~100 lines deleted). Replaced by `bin/gstack-update-check` bash script using semver VERSION comparison with 24h cache.
-- Simplified `qa/SKILL.md` and `setup-browser-cookies/SKILL.md` setup blocks — removed old `BROWSE_OUTPUT`/`META` parsing, now use simple `find-browse` call
-- Updated `browse/bin/find-browse` shim comments to reflect simplified role (binary locator only)
-
-### Removed
-- `checkVersion()`, `readCache()`, `writeCache()`, `fetchRemoteSHA()`, `resolveSkillDir()`, `CacheEntry` interface from `browse/src/find-browse.ts`
-- `META:UPDATE_AVAILABLE` protocol from find-browse output
-- Old META-based upgrade instructions from qa and setup-browser-cookies SKILL.md files
-- Legacy `/tmp/gstack-latest-version` cache file (cleaned up by `setup` script)
-
-## 0.3.5 — 2026-03-14
+- **E2E observability** — heartbeat file (`~/.gstack-dev/e2e-live.json`), per-run log directory (`~/.gstack-dev/e2e-runs/{runId}/`), progress.log, per-test NDJSON transcripts, persistent failure transcripts. All I/O non-fatal.
+- **`bun run eval:watch`** — live terminal dashboard reads heartbeat + partial eval file every 1s. Shows completed tests, current test with turn/tool info, stale detection (>10min), `--tail` for progress.log.
+- **Incremental eval saves** — `savePartial()` writes `_partial-e2e.json` after each test completes. Crash-resilient: partial results survive killed runs. Never cleaned up.
+- **Machine-readable diagnostics** — `exit_reason`, `timeout_at_turn`, `last_tool_call` fields in eval JSON. Enables `jq` queries for automated fix loops.
+- **API connectivity pre-check** — E2E suite throws immediately on ConnectionRefused before burning test budget.
+- **`is_error` detection** — `claude -p` can return `subtype: "success"` with `is_error: true` on API failures. Now correctly classified as `error_api`.
+- **Stream-json NDJSON parser** — `parseNDJSON()` pure function for real-time E2E progress from `claude -p --output-format stream-json --verbose`.
+- **Eval persistence** — results saved to `~/.gstack-dev/evals/` with auto-comparison against previous run.
+- **Eval CLI tools** — `eval:list`, `eval:compare`, `eval:summary` for inspecting eval history.
+- **All 9 skills converted to `.tmpl` templates** — plan-ceo-review, plan-eng-review, retro, review, ship now use `{{UPDATE_CHECK}}` placeholder. Single source of truth for update check preamble.
+- **3-tier eval suite** — Tier 1: static validation (free), Tier 2: E2E via `claude -p` (~$3.85/run), Tier 3: LLM-as-judge (~$0.15/run). Gated by `EVALS=1`.
+- **Planted-bug outcome testing** — eval fixtures with known bugs, LLM judge scores detection.
+- 15 observability unit tests covering heartbeat schema, progress.log format, NDJSON naming, savePartial, finalize, watcher rendering, stale detection, non-fatal I/O.
+- E2E tests for plan-ceo-review, plan-eng-review, retro skills.
+- Update-check exit code regression tests.
+- `test/helpers/skill-parser.ts` — `getRemoteSlug()` for git remote detection.
 
 ### Fixed
-- **Browse binary discovery broken for agents** — replaced `find-browse` indirection with explicit `browse/dist/browse` path in SKILL.md setup blocks. Agents were guessing `bin/browse` (wrong) instead of running `find-browse` to discover `browse/dist/browse` (correct).
-- **Update check exit code 1 misleading agents** — `[ -n "$_UPD" ] && echo "$_UPD"` returned exit code 1 when no update available, causing agents to think gstack was broken. Added `|| true`.
-- **browse/SKILL.md missing setup block** — `/browse` used `$B` in every example but never defined it. Added `{{BROWSE_SETUP}}` placeholder.
+- **Browse binary discovery broken for agents** — replaced `find-browse` indirection with explicit `browse/dist/browse` path in SKILL.md setup blocks.
+- **Update check exit code 1 misleading agents** — added `|| true` to prevent non-zero exit when no update available.
+- **browse/SKILL.md missing setup block** — added `{{BROWSE_SETUP}}` placeholder.
+- **plan-ceo-review timeout** — init git repo in test dir, skip codebase exploration, bump timeout to 420s.
+- Planted-bug eval reliability — simplified prompts, lowered detection baselines, resilient to max_turns flakes.
 
 ### Changed
-- Enriched 14 command descriptions with specific arg formats, valid values, error behavior, and return types
-- Fixed `header` usage from `<name> <value>` to `<name>:<value>` (matching actual implementation)
-- Added `cookie` usage syntax: `cookie <name>=<value>`
-- **Template system expanded** — added `{{UPDATE_CHECK}}` and `{{BROWSE_SETUP}}` placeholders to `gen-skill-docs.ts`. Converted `qa/SKILL.md` and `setup-browser-cookies/SKILL.md` to `.tmpl` templates. All 4 browse-using skills now generate from a single source of truth.
-- Setup block now checks workspace-local path first (for development), then falls back to global `~/.claude/skills/gstack/browse/dist/browse`
-
-### Added
-- 3 new e2e test cases for SKILL.md setup flow: happy path, NEEDS_SETUP, non-git-repo
-- LLM eval for setup block clarity (actionability + clarity >= 4)
-- `no such file or directory.*browse` error pattern in session-runner
-- TODO: convert remaining 5 non-browse skills to .tmpl files
-- Enriched 4 snapshot flag descriptions with defaults, output paths, and behavior details
-- Snapshot flags section now shows long flag names (`-i / --interactive`) alongside short
-- Added ref numbering explanation and output format example to snapshot docs
-- Replaced hand-maintained server.ts help text with auto-generated `generateHelpText()` from COMMAND_DESCRIPTIONS
-- Upgraded LLM eval judge from Haiku to Sonnet 4.6 for more stable scoring
-
-### Added
-- Usage string consistency test: cross-checks `Usage:` patterns in implementation against COMMAND_DESCRIPTIONS
-- Pipe guard test: ensures no command description contains `|` (would break markdown tables)
+- **Template system expanded** — `{{UPDATE_CHECK}}` and `{{BROWSE_SETUP}}` placeholders in `gen-skill-docs.ts`. All browse-using skills generate from single source of truth.
+- Enriched 14 command descriptions with specific arg formats, valid values, error behavior, and return types.
+- Setup block checks workspace-local path first (for development), falls back to global install.
+- LLM eval judge upgraded from Haiku to Sonnet 4.6.
+- `generateHelpText()` auto-generated from COMMAND_DESCRIPTIONS (replaces hand-maintained help text).
 
 ## 0.3.3 — 2026-03-13
 

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -4,16 +4,23 @@
 
 ```bash
 bun install          # install dependencies
-bun test             # run tests (browse + snapshot + skill validation)
-bun run test:eval    # run LLM-as-judge evals (needs ANTHROPIC_API_KEY)
-bun run test:e2e     # run E2E skill tests (needs SKILL_E2E=1, ~$0.50/run)
+bun test             # run free tests (browse + snapshot + skill validation)
+bun run test:evals   # run paid evals: LLM judge + E2E (~$4/run)
+bun run test:e2e     # run E2E tests only (~$3.85/run)
 bun run dev <cmd>    # run CLI in dev mode, e.g. bun run dev goto https://example.com
 bun run build        # gen docs + compile binaries
 bun run gen:skill-docs  # regenerate SKILL.md files from templates
 bun run skill:check  # health dashboard for all skills
 bun run dev:skill    # watch mode: auto-regen + validate on change
+bun run eval:list    # list all eval runs from ~/.gstack-dev/evals/
+bun run eval:compare # compare two eval runs (auto-picks most recent)
+bun run eval:summary # aggregate stats across all eval runs
 ```
 
+`test:evals` requires `ANTHROPIC_API_KEY`. E2E tests stream progress in real-time
+(tool-by-tool via `--output-format stream-json --verbose`). Results are persisted
+to `~/.gstack-dev/evals/` with auto-comparison against the previous run.
+
 ## Project structure
 
 ```
@@ -29,11 +36,12 @@ gstack/
 │   ├── skill-check.ts     # Health dashboard
 │   └── dev-skill.ts       # Watch mode
 ├── test/            # Skill validation + eval tests
-│   ├── helpers/     # skill-parser.ts, session-runner.ts
-│   ├── skill-validation.test.ts  # Tier 1: static command validation
-│   ├── gen-skill-docs.test.ts    # Tier 1: generator + quality evals
-│   ├── skill-e2e.test.ts         # Tier 2: Agent SDK E2E
-│   └── skill-llm-eval.test.ts   # Tier 3: LLM-as-judge
+│   ├── helpers/     # skill-parser.ts, session-runner.ts, llm-judge.ts, eval-store.ts
+│   ├── fixtures/    # Ground truth JSON, planted-bug fixtures, eval baselines
+│   ├── skill-validation.test.ts  # Tier 1: static validation (free, <1s)
+│   ├── gen-skill-docs.test.ts    # Tier 1: generator quality (free, <1s)
+│   ├── skill-llm-eval.test.ts   # Tier 3: LLM-as-judge (~$0.15/run)
+│   └── skill-e2e.test.ts         # Tier 2: E2E via claude -p (~$3.85/run)
 ├── ship/            # Ship workflow skill
 ├── review/          # PR review skill
 ├── plan-ceo-review/ # /plan-ceo-review skill