Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
02f0ca6
chore: regenerate SKILL.md from template
garrytan Mar 14, 2026
ff5cbbb
feat: add remote slug helper and auto-gitignore for .gstack/
garrytan Mar 14, 2026
e04ad1b
feat: QA test plan tiers with per-page risk scoring
garrytan Mar 14, 2026
e377ba2
feat: dual greptile-history paths (per-project + global)
garrytan Mar 14, 2026
5155fe3
Merge remote-tracking branch 'origin/main' into v0.3.5-qa-upgrades
garrytan Mar 14, 2026
76803d7
feat: 3-tier eval suite with planted-bug outcome testing (EVALS=1)
garrytan Mar 14, 2026
b5b2a15
fix: pass all LLM evals — severity defs, rubric edge cases, EVALS=1 flag
garrytan Mar 14, 2026
942df42
simplify: one command for evals — bun run test:evals
garrytan Mar 14, 2026
c35e933
fix: rewrite session-runner to claude -p subprocess, lower flaky base…
garrytan Mar 14, 2026
3d750d8
Merge remote-tracking branch 'origin/main' into v0.3.6-qa-upgrades
garrytan Mar 14, 2026
e7347c2
feat: stream-json NDJSON parser for real-time E2E progress
garrytan Mar 14, 2026
84f52f3
feat: eval persistence with auto-compare against previous run
garrytan Mar 14, 2026
ed802d0
feat: eval CLI tools + docs cleanup
garrytan Mar 14, 2026
a67dae5
fix: update check preamble exits 1 when up to date — convert all skil…
garrytan Mar 14, 2026
4063104
fix: remove false-positive Exit code 1 pattern, fix NEEDS_SETUP test,…
garrytan Mar 14, 2026
2e75c33
fix: lower planted-bug detection baselines and LLM judge thresholds f…
garrytan Mar 14, 2026
4a56b88
fix: make planted-bug evals resilient to max_turns and browse error f…
garrytan Mar 14, 2026
cddf8ee
fix: simplify planted-bug eval prompts for reliable 25-turn completion
garrytan Mar 14, 2026
c6c3294
fix: 100% E2E pass — isolate test dirs, restart server, relax FP thre…
garrytan Mar 14, 2026
2d88f5f
test: add update-check exit code regression tests
garrytan Mar 14, 2026
f1ee3d9
feat: template-ify all skills + E2E tests for plan-ceo-review, plan-e…
garrytan Mar 14, 2026
7d5036d
fix: increase timeouts for plan-review and retro E2E tests
garrytan Mar 14, 2026
eb9a919
fix: plan-ceo-review timeout — init git repo, skip codebase explorati…
garrytan Mar 14, 2026
f9cfabe
feat: add E2E observability — heartbeat, progress.log, NDJSON persist…
garrytan Mar 14, 2026
510a8d8
feat: wire runId + testName + diagnostics through all E2E tests
garrytan Mar 14, 2026
029a7c2
feat: eval-watch dashboard + observability unit tests (15 tests, 11 c…
garrytan Mar 14, 2026
336dbaa
fix: detect is_error from claude -p result line (ConnectionRefused wa…
garrytan Mar 14, 2026
5aae3ce
fix: never clean up observability artifacts — partial file persists a…
garrytan Mar 14, 2026
9f5aa32
fix: fail fast on API connectivity — pre-check before E2E suite
garrytan Mar 14, 2026
4ace0c2
chore: bump version and changelog (v0.3.6)
garrytan Mar 14, 2026
43fbe16
docs: update README, CONTRIBUTING, ARCHITECTURE for v0.3.6
garrytan Mar 14, 2026
4e31acb
fix: auto-clear stale heartbeat when process is dead
garrytan Mar 14, 2026
baf8acd
fix: update check ignores stale UP_TO_DATE cache after version change
garrytan Mar 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 86 additions & 4 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,15 +189,15 @@ Three reasons:
2. **CI can validate freshness.** `gen:skill-docs --dry-run` + `git diff --exit-code` catches stale docs before merge.
3. **Git blame works.** You can see when a command was added and in which commit.

### Test tiers
### Template test tiers

| Tier | What | Cost | Speed |
|------|------|------|-------|
| 1 — Static validation | Parse every `$B` command in SKILL.md, validate against registry | Free | <2s |
| 2 — E2E via Agent SDK | Spawn real Claude session, run `/qa`, check for errors | ~$0.50 | ~60s |
| 3 — LLM-as-judge | Haiku scores docs on clarity/completeness/actionability | ~$0.03 | ~10s |
| 2 — E2E via `claude -p` | Spawn real Claude session, run each skill, check for errors | ~$3.85 | ~20min |
| 3 — LLM-as-judge | Sonnet scores docs on clarity/completeness/actionability | ~$0.15 | ~30s |

Tier 1 runs on every `bun test`. Tier 2 and 3 are gated behind env vars. The idea is: catch 95% of issues for free, use LLMs only for the judgment calls.
Tier 1 runs on every `bun test`. Tiers 2+3 are gated behind `EVALS=1`. The idea is: catch 95% of issues for free, use LLMs only for judgment calls.

## Command dispatch

Expand Down Expand Up @@ -231,6 +231,88 @@ Playwright's native errors are rewritten through `wrapError()` to strip internal

The server doesn't try to self-heal. If Chromium crashes (`browser.on('disconnected')`), the server exits immediately. The CLI detects the dead server on the next command and auto-restarts. This is simpler and more reliable than trying to reconnect to a half-dead browser process.

## E2E test infrastructure

### Session runner (`test/helpers/session-runner.ts`)

E2E tests spawn `claude -p` as a completely independent subprocess — not via the Agent SDK, which can't nest inside Claude Code sessions. The runner:

1. Writes the prompt to a temp file (avoids shell escaping issues)
2. Spawns `sh -c 'cat prompt | claude -p --output-format stream-json --verbose'`
3. Streams NDJSON from stdout for real-time progress
4. Races against a configurable timeout
5. Parses the full NDJSON transcript into structured results

The `parseNDJSON()` function is pure — no I/O, no side effects — making it independently testable.

### Observability data flow

```
skill-e2e.test.ts
│ generates runId, passes testName + runId to each call
┌─────┼──────────────────────────────┐
│ │ │
│ runSkillTest() evalCollector
│ (session-runner.ts) (eval-store.ts)
│ │ │
│ per tool call: per addTest():
│ ┌──┼──────────┐ savePartial()
│ │ │ │ │
│ ▼ ▼ ▼ ▼
│ [HB] [PL] [NJ] _partial-e2e.json
│ │ │ │ (atomic overwrite)
│ │ │ │
│ ▼ ▼ ▼
│ e2e- prog- {name}
│ live ress .ndjson
│ .json .log
│ on failure:
│ {name}-failure.json
│ ALL files in ~/.gstack-dev/
│ Run dir: e2e-runs/{runId}/
│ eval-watch.ts
│ │
│ ┌─────┴─────┐
│ read HB read partial
│ └─────┬─────┘
│ ▼
│ render dashboard
│ (stale >10min? warn)
```

**Split ownership:** session-runner owns the heartbeat (current test state), eval-store owns partial results (completed test state). The watcher reads both. Neither component knows about the other — they share data only through the filesystem.

**Non-fatal everything:** All observability I/O is wrapped in try/catch. A write failure never causes a test to fail. The tests themselves are the source of truth; observability is best-effort.

**Machine-readable diagnostics:** Each test result includes `exit_reason` (success, timeout, error_max_turns, error_api, exit_code_N), `timeout_at_turn`, and `last_tool_call`. This enables `jq` queries like:
```bash
jq '.tests[] | select(.exit_reason == "timeout") | .last_tool_call' ~/.gstack-dev/evals/_partial-e2e.json
```

### Eval persistence (`test/helpers/eval-store.ts`)

The `EvalCollector` accumulates test results and writes them in two ways:

1. **Incremental:** `savePartial()` writes `_partial-e2e.json` after each test (atomic: write `.tmp`, `fs.renameSync`). Survives kills.
2. **Final:** `finalize()` writes a timestamped eval file (e.g. `e2e-20260314-143022.json`). The partial file is never cleaned up — it persists alongside the final file for observability.

`eval:compare` diffs two eval runs. `eval:summary` aggregates stats across all runs in `~/.gstack-dev/evals/`.

### Test tiers

| Tier | What | Cost | Speed |
|------|------|------|-------|
| 1 — Static validation | Parse `$B` commands, validate against registry, observability unit tests | Free | <5s |
| 2 — E2E via `claude -p` | Spawn real Claude session, run each skill, scan for errors | ~$3.85 | ~20min |
| 3 — LLM-as-judge | Sonnet scores docs on clarity/completeness/actionability | ~$0.15 | ~30s |

Tier 1 runs on every `bun test`. Tiers 2+3 are gated behind `EVALS=1`. The idea: catch 95% of issues for free, use LLMs only for judgment calls and integration testing.

## What's intentionally not here

- **No WebSocket streaming.** HTTP request/response is simpler, debuggable with curl, and fast enough. Streaming would add complexity for marginal benefit.
Expand Down
71 changes: 27 additions & 44 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,55 +1,38 @@
# Changelog

## 0.3.4 — 2026-03-13
## 0.3.6 — 2026-03-14

### Added
- **Daily update check** — all 9 skills now check for new versions once per day via `bin/gstack-update-check` (pure bash, <5ms cached). Prompts user via AskUserQuestion with option to upgrade or defer 24h.
- **`/gstack-upgrade` skill** — standalone upgrade command that detects install type (global-git, local-git, vendored), upgrades, and shows a "What's New" summary from CHANGELOG
- **"Just upgraded" confirmation** — after upgrading, the next skill invocation shows "Running gstack v{new} (just updated!)" via `~/.gstack/just-upgraded-from` marker
- **`AskUserQuestion` added to 5 skills** — gstack (root), browse, qa, retro, setup-browser-cookies now have AskUserQuestion in allowed-tools for upgrade prompts
- **`Bash` added to plan-eng-review** — enables the update check preamble to run in plan review sessions
- `browse/test/gstack-update-check.test.ts` — 10 test cases covering all script branch paths with `GSTACK_REMOTE_URL` env var for test isolation
- `TODOS.md` for tracking deferred work

### Changed
- **Version check is now one system** — removed SHA-based `checkVersion()` from `browse/src/find-browse.ts` (~120 lines deleted) and `browse/test/find-browse.test.ts` (~100 lines deleted). Replaced by `bin/gstack-update-check` bash script using semver VERSION comparison with 24h cache.
- Simplified `qa/SKILL.md` and `setup-browser-cookies/SKILL.md` setup blocks — removed old `BROWSE_OUTPUT`/`META` parsing, now use simple `find-browse` call
- Updated `browse/bin/find-browse` shim comments to reflect simplified role (binary locator only)

### Removed
- `checkVersion()`, `readCache()`, `writeCache()`, `fetchRemoteSHA()`, `resolveSkillDir()`, `CacheEntry` interface from `browse/src/find-browse.ts`
- `META:UPDATE_AVAILABLE` protocol from find-browse output
- Old META-based upgrade instructions from qa and setup-browser-cookies SKILL.md files
- Legacy `/tmp/gstack-latest-version` cache file (cleaned up by `setup` script)

## 0.3.5 — 2026-03-14
- **E2E observability** — heartbeat file (`~/.gstack-dev/e2e-live.json`), per-run log directory (`~/.gstack-dev/e2e-runs/{runId}/`), progress.log, per-test NDJSON transcripts, persistent failure transcripts. All I/O non-fatal.
- **`bun run eval:watch`** — live terminal dashboard reads heartbeat + partial eval file every 1s. Shows completed tests, current test with turn/tool info, stale detection (>10min), `--tail` for progress.log.
- **Incremental eval saves**`savePartial()` writes `_partial-e2e.json` after each test completes. Crash-resilient: partial results survive killed runs. Never cleaned up.
- **Machine-readable diagnostics**`exit_reason`, `timeout_at_turn`, `last_tool_call` fields in eval JSON. Enables `jq` queries for automated fix loops.
- **API connectivity pre-check** — E2E suite throws immediately on ConnectionRefused before burning test budget.
- **`is_error` detection**`claude -p` can return `subtype: "success"` with `is_error: true` on API failures. Now correctly classified as `error_api`.
- **Stream-json NDJSON parser**`parseNDJSON()` pure function for real-time E2E progress from `claude -p --output-format stream-json --verbose`.
- **Eval persistence** — results saved to `~/.gstack-dev/evals/` with auto-comparison against previous run.
- **Eval CLI tools**`eval:list`, `eval:compare`, `eval:summary` for inspecting eval history.
- **All 9 skills converted to `.tmpl` templates** — plan-ceo-review, plan-eng-review, retro, review, ship now use `{{UPDATE_CHECK}}` placeholder. Single source of truth for update check preamble.
- **3-tier eval suite** — Tier 1: static validation (free), Tier 2: E2E via `claude -p` (~$3.85/run), Tier 3: LLM-as-judge (~$0.15/run). Gated by `EVALS=1`.
- **Planted-bug outcome testing** — eval fixtures with known bugs, LLM judge scores detection.
- 15 observability unit tests covering heartbeat schema, progress.log format, NDJSON naming, savePartial, finalize, watcher rendering, stale detection, non-fatal I/O.
- E2E tests for plan-ceo-review, plan-eng-review, retro skills.
- Update-check exit code regression tests.
- `test/helpers/skill-parser.ts``getRemoteSlug()` for git remote detection.

### Fixed
- **Browse binary discovery broken for agents** — replaced `find-browse` indirection with explicit `browse/dist/browse` path in SKILL.md setup blocks. Agents were guessing `bin/browse` (wrong) instead of running `find-browse` to discover `browse/dist/browse` (correct).
- **Update check exit code 1 misleading agents**`[ -n "$_UPD" ] && echo "$_UPD"` returned exit code 1 when no update available, causing agents to think gstack was broken. Added `|| true`.
- **browse/SKILL.md missing setup block**`/browse` used `$B` in every example but never defined it. Added `{{BROWSE_SETUP}}` placeholder.
- **Browse binary discovery broken for agents** — replaced `find-browse` indirection with explicit `browse/dist/browse` path in SKILL.md setup blocks.
- **Update check exit code 1 misleading agents** — added `|| true` to prevent non-zero exit when no update available.
- **browse/SKILL.md missing setup block** — added `{{BROWSE_SETUP}}` placeholder.
- **plan-ceo-review timeout** — init git repo in test dir, skip codebase exploration, bump timeout to 420s.
- Planted-bug eval reliability — simplified prompts, lowered detection baselines, resilient to max_turns flakes.

### Changed
- Enriched 14 command descriptions with specific arg formats, valid values, error behavior, and return types
- Fixed `header` usage from `<name> <value>` to `<name>:<value>` (matching actual implementation)
- Added `cookie` usage syntax: `cookie <name>=<value>`
- **Template system expanded** — added `{{UPDATE_CHECK}}` and `{{BROWSE_SETUP}}` placeholders to `gen-skill-docs.ts`. Converted `qa/SKILL.md` and `setup-browser-cookies/SKILL.md` to `.tmpl` templates. All 4 browse-using skills now generate from a single source of truth.
- Setup block now checks workspace-local path first (for development), then falls back to global `~/.claude/skills/gstack/browse/dist/browse`

### Added
- 3 new e2e test cases for SKILL.md setup flow: happy path, NEEDS_SETUP, non-git-repo
- LLM eval for setup block clarity (actionability + clarity >= 4)
- `no such file or directory.*browse` error pattern in session-runner
- TODO: convert remaining 5 non-browse skills to .tmpl files
- Enriched 4 snapshot flag descriptions with defaults, output paths, and behavior details
- Snapshot flags section now shows long flag names (`-i / --interactive`) alongside short
- Added ref numbering explanation and output format example to snapshot docs
- Replaced hand-maintained server.ts help text with auto-generated `generateHelpText()` from COMMAND_DESCRIPTIONS
- Upgraded LLM eval judge from Haiku to Sonnet 4.6 for more stable scoring

### Added
- Usage string consistency test: cross-checks `Usage:` patterns in implementation against COMMAND_DESCRIPTIONS
- Pipe guard test: ensures no command description contains `|` (would break markdown tables)
- **Template system expanded**`{{UPDATE_CHECK}}` and `{{BROWSE_SETUP}}` placeholders in `gen-skill-docs.ts`. All browse-using skills generate from single source of truth.
- Enriched 14 command descriptions with specific arg formats, valid values, error behavior, and return types.
- Setup block checks workspace-local path first (for development), falls back to global install.
- LLM eval judge upgraded from Haiku to Sonnet 4.6.
- `generateHelpText()` auto-generated from COMMAND_DESCRIPTIONS (replaces hand-maintained help text).

## 0.3.3 — 2026-03-13

Expand Down
24 changes: 16 additions & 8 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,23 @@

```bash
bun install # install dependencies
bun test # run tests (browse + snapshot + skill validation)
bun run test:eval # run LLM-as-judge evals (needs ANTHROPIC_API_KEY)
bun run test:e2e # run E2E skill tests (needs SKILL_E2E=1, ~$0.50/run)
bun test # run free tests (browse + snapshot + skill validation)
bun run test:evals # run paid evals: LLM judge + E2E (~$4/run)
bun run test:e2e # run E2E tests only (~$3.85/run)
bun run dev <cmd> # run CLI in dev mode, e.g. bun run dev goto https://example.com
bun run build # gen docs + compile binaries
bun run gen:skill-docs # regenerate SKILL.md files from templates
bun run skill:check # health dashboard for all skills
bun run dev:skill # watch mode: auto-regen + validate on change
bun run eval:list # list all eval runs from ~/.gstack-dev/evals/
bun run eval:compare # compare two eval runs (auto-picks most recent)
bun run eval:summary # aggregate stats across all eval runs
```

`test:evals` requires `ANTHROPIC_API_KEY`. E2E tests stream progress in real-time
(tool-by-tool via `--output-format stream-json --verbose`). Results are persisted
to `~/.gstack-dev/evals/` with auto-comparison against the previous run.

## Project structure

```
Expand All @@ -29,11 +36,12 @@ gstack/
│ ├── skill-check.ts # Health dashboard
│ └── dev-skill.ts # Watch mode
├── test/ # Skill validation + eval tests
│ ├── helpers/ # skill-parser.ts, session-runner.ts
│ ├── skill-validation.test.ts # Tier 1: static command validation
│ ├── gen-skill-docs.test.ts # Tier 1: generator + quality evals
│ ├── skill-e2e.test.ts # Tier 2: Agent SDK E2E
│ └── skill-llm-eval.test.ts # Tier 3: LLM-as-judge
│ ├── helpers/ # skill-parser.ts, session-runner.ts, llm-judge.ts, eval-store.ts
│ ├── fixtures/ # Ground truth JSON, planted-bug fixtures, eval baselines
│ ├── skill-validation.test.ts # Tier 1: static validation (free, <1s)
│ ├── gen-skill-docs.test.ts # Tier 1: generator quality (free, <1s)
│ ├── skill-llm-eval.test.ts # Tier 3: LLM-as-judge (~$0.15/run)
│ └── skill-e2e.test.ts # Tier 2: E2E via claude -p (~$3.85/run)
├── ship/ # Ship workflow skill
├── review/ # PR review skill
├── plan-ceo-review/ # /plan-ceo-review skill
Expand Down
Loading