Two crash signatures on v0.33.130 (2026-04-13) — renderer death + host hang

Two crash signatures observed on a single long-running session (v0.33.130 portable, launched 15:31 PDT 2026-04-13, reported while drafting the graceful-crash-handling spec). Both processes are **still alive in \`tasklist\` at report time** — log output frozen for hours, but PIDs 14732 (host), 22804 + 3880 (sidecar), 15196 (launcher) all present.

Full post-mortem: [\`docs/analysis/crashes-2026-04-13.md\`](https://github.com/agentmuxai/agentmux/blob/main/docs/analysis/crashes-2026-04-13.md)
Related spec: [\`SPEC_GRACEFUL_CRASH_HANDLING_2026_04_13.md\`](https://github.com/agentmuxai/agentmux/blob/main/specs/SPEC_GRACEFUL_CRASH_HANDLING_2026_04_13.md)
In-flight PR: #375 (catches crash #1 directly)

## Timeline

| PDT | UTC | Event |
|---|---|---|
| 15:31:33 | 22:31:33 | Launcher spawns v0.33.130 host; sidecar starts |
| 15:31:34 | 22:31:34 | Main WS conn (\`b32df86e\`) connects |
| **16:50:24** | **23:50:24** | **Crash #1.** Main WS disconnects. No reconnect. |
| 16:59 → 23:36 | 23:59 → 06:36 (04-14) | Host memory heartbeats keep ticking every 20s — **renderer is dead but host is alive** |
| 23:31:33 | 06:31:33 | Sidecar's last hourly archiver sweep |
| **23:36:14** | **06:36:14** | **Crash #2.** Host memory heartbeat stops. No further output. |
| (after) | — | Launcher never writes \`exited with code N\` for v0.33.130 — still thinks host is alive |

## Crash #1 — CEF renderer subprocess died (the \"white screen\")

**Signature:**
- Sidecar log: \`\"WebSocket client disconnected\",\"conn_id\":\"b32df86e-9ff1-4751-9e74-0648d41573f1\"\` at 22:30:24 UTC
- Host log keeps emitting memory heartbeats **for hours after the WS drop** — proves only the renderer subprocess died
- No WER dump (CEF subprocess crashes go to CEF's Crashpad, not WER)
- System memory at the time: **35–38% load, ~20 GB free** — **not an OS-level OOM**

**Root cause (likely):** Chromium renderer-internal failure — V8 heap OOM inside the renderer's own address space, a JS panic, or a Blink-side bug. 78-minute session isn't particularly long, but long agent sessions accumulate DOM + JS objects inside the capped per-process heap even with \`content-visibility: auto\`.

**Why no recovery today:** \`agentmux-cef/src/client.rs\` has no \`CefRequestHandler\` — so \`on_render_process_terminated\` never fires on our side. Window stays white, last paint frozen, no way for the user to recover except closing via Task Manager.

**Fix in flight:** PR #375 (step 1 of SPEC_GRACEFUL_CRASH_HANDLING). Adds \`AgentMuxRequestHandler\`, logs the termination as \`target=\"crash\", kind=\"renderer_terminated\"\`, and loads a self-contained recovery HTML page (Reload / Quit buttons) as a \`data:\` URL on the dead browser's main frame.

## Crash #2 — Host process hung (still alive, zero log output)

**Signature:**
- Host memory heartbeat stops emitting at 06:36:14 UTC 04-14
- Sidecar hourly session-archiver sweep stops 5 minutes earlier at 06:31:33 UTC
- **Both processes still resident in memory per \`tasklist\`** (host PID 14732 at 103 MB, sidecar PIDs 22804 + 3880)
- Launcher log has **no \`exited with code N\`** for v0.33.130 — it's still waiting
- No WER dump, no Crashpad dump, no exception trace
- System memory at the last heartbeat: **31% load, 22 GB free** — **not an OOM**

**Interpretation:** This is a **hang**, not a crash. All threads in the host process (UI, heartbeat, tracing, IPC) stopped making forward progress. The heartbeat thread is an infinite loop independent of UI input, so its stopping means:

1. Global lock contention — a downstream sink of \`tracing::info!\` wedged, back-pressuring every caller including \`mem-heartbeat\`.
2. OS process suspension (Task Manager → Suspend, or a debugger attachment).
3. A kernel-mode driver bug that paused the process (rare but happens with GPU/display drivers, which CEF exercises).

The sidecar dying **5 minutes earlier than the host** argues against a simultaneous OS suspend — this was a slow cascade, first sidecar, then host.

## Gap in the current spec

\`SPEC_GRACEFUL_CRASH_HANDLING_2026_04_13.md\` addresses crashes that **terminate** a process — it catches render-process termination, unhandled JS, sidecar exit. It does **not** catch crash #2 (host keeps its PID but stops executing). The launcher waits on the host PID, not on its liveness.

**Proposed follow-up: Step 7 — host-hang watchdog.** A dedicated thread in the sidecar (the only process we can be sure stays alive when the host hangs) that wakes once per minute and checks the host log's mtime. If it hasn't moved in N minutes while the host PID is still alive, log a hang warning and optionally kill the host PID so the launcher can re-spawn it.

**Caveat:** user-idle (minimized window, locked screen) shouldn't trigger the watchdog. The host's memory heartbeat runs independent of UI input every 20s, so it's a reliable liveness signal — if that stops for >2 minutes, the process is wedged, not idle.

## Action items

- [x] Draft post-mortem (\`docs/analysis/crashes-2026-04-13.md\`)
- [ ] Merge #375 (crash #1 fix)
- [ ] Implement Step 2 of the spec: frontend \`<ErrorBoundary>\` + \`window.onerror\` / \`onunhandledrejection\` — catches JS-layer failures that don't terminate the renderer
- [ ] Add Step 7 to the spec: host-hang watchdog (covers crash #2)
- [ ] Investigate renderer V8 heap budget — separate line of work. Long agent sessions accumulate DOM/JS inside the renderer's own capped heap. Would reduce the *frequency* of crash #1, not just the symptoms.

## Artifacts

All under \`~/.agentmux/logs/\`:

- \`agentmux-host-v0.33.130.log.2026-04-13\` (419 KB, 1202 lines, last at 23:59:53 UTC)
- \`agentmux-host-v0.33.130.log.2026-04-14\` (657 KB, 2314 lines, last at 06:36:14 UTC)
- \`agentmuxsrv-v0.33.130.log.2026-04-13\` (128 KB)
- \`agentmuxsrv-v0.33.130.log.2026-04-14\` (1.7 KB, hourly sweeps, stops at 06:31:33 UTC)
- \`agentmux-launcher.log\` — last entry \`v0.33.130 spawning CEF host with 0 args\` at epoch 1776119493 (15:31:33 PDT); **no exit line**

No WER crash dumps for this session. No CEF Crashpad dumps for this session.

---

Labels: \`bug\`, \`crash\`, \`postmortem\`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two crash signatures on v0.33.130 (2026-04-13) — renderer death + host hang #376

Timeline

Crash #1 — CEF renderer subprocess died (the "white screen")

Crash #2 — Host process hung (still alive, zero log output)

Gap in the current spec

Action items

Artifacts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PDT	UTC	Event
15:31:33	22:31:33	Launcher spawns v0.33.130 host; sidecar starts
15:31:34	22:31:34	Main WS conn (`b32df86e`) connects
16:50:24	23:50:24	Crash #1. Main WS disconnects. No reconnect.
16:59 → 23:36	23:59 → 06:36 (04-14)	Host memory heartbeats keep ticking every 20s — renderer is dead but host is alive
23:31:33	06:31:33	Sidecar's last hourly archiver sweep
23:36:14	06:36:14	Crash #2. Host memory heartbeat stops. No further output.
(after)	—	Launcher never writes `exited with code N` for v0.33.130 — still thinks host is alive

Two crash signatures on v0.33.130 (2026-04-13) — renderer death + host hang #376

Description

Timeline

Crash #1 — CEF renderer subprocess died (the "white screen")

Crash #2 — Host process hung (still alive, zero log output)

Gap in the current spec

Action items

Artifacts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions