Two crash signatures observed on a single long-running session (v0.33.130 portable, launched 15:31 PDT 2026-04-13, reported while drafting the graceful-crash-handling spec). Both processes are still alive in `tasklist` at report time — log output frozen for hours, but PIDs 14732 (host), 22804 + 3880 (sidecar), 15196 (launcher) all present.
Full post-mortem: `docs/analysis/crashes-2026-04-13.md`
Related spec: `SPEC_GRACEFUL_CRASH_HANDLING_2026_04_13.md`
In-flight PR: #375 (catches crash #1 directly)
Timeline
| PDT |
UTC |
Event |
| 15:31:33 |
22:31:33 |
Launcher spawns v0.33.130 host; sidecar starts |
| 15:31:34 |
22:31:34 |
Main WS conn (`b32df86e`) connects |
| 16:50:24 |
23:50:24 |
Crash #1. Main WS disconnects. No reconnect. |
| 16:59 → 23:36 |
23:59 → 06:36 (04-14) |
Host memory heartbeats keep ticking every 20s — renderer is dead but host is alive |
| 23:31:33 |
06:31:33 |
Sidecar's last hourly archiver sweep |
| 23:36:14 |
06:36:14 |
Crash #2. Host memory heartbeat stops. No further output. |
| (after) |
— |
Launcher never writes `exited with code N` for v0.33.130 — still thinks host is alive |
Crash #1 — CEF renderer subprocess died (the "white screen")
Signature:
- Sidecar log: `"WebSocket client disconnected","conn_id":"b32df86e-9ff1-4751-9e74-0648d41573f1"` at 22:30:24 UTC
- Host log keeps emitting memory heartbeats for hours after the WS drop — proves only the renderer subprocess died
- No WER dump (CEF subprocess crashes go to CEF's Crashpad, not WER)
- System memory at the time: 35–38% load, ~20 GB free — not an OS-level OOM
Root cause (likely): Chromium renderer-internal failure — V8 heap OOM inside the renderer's own address space, a JS panic, or a Blink-side bug. 78-minute session isn't particularly long, but long agent sessions accumulate DOM + JS objects inside the capped per-process heap even with `content-visibility: auto`.
Why no recovery today: `agentmux-cef/src/client.rs` has no `CefRequestHandler` — so `on_render_process_terminated` never fires on our side. Window stays white, last paint frozen, no way for the user to recover except closing via Task Manager.
Fix in flight: PR #375 (step 1 of SPEC_GRACEFUL_CRASH_HANDLING). Adds `AgentMuxRequestHandler`, logs the termination as `target="crash", kind="renderer_terminated"`, and loads a self-contained recovery HTML page (Reload / Quit buttons) as a `data:` URL on the dead browser's main frame.
Crash #2 — Host process hung (still alive, zero log output)
Signature:
- Host memory heartbeat stops emitting at 06:36:14 UTC 04-14
- Sidecar hourly session-archiver sweep stops 5 minutes earlier at 06:31:33 UTC
- Both processes still resident in memory per `tasklist` (host PID 14732 at 103 MB, sidecar PIDs 22804 + 3880)
- Launcher log has no `exited with code N` for v0.33.130 — it's still waiting
- No WER dump, no Crashpad dump, no exception trace
- System memory at the last heartbeat: 31% load, 22 GB free — not an OOM
Interpretation: This is a hang, not a crash. All threads in the host process (UI, heartbeat, tracing, IPC) stopped making forward progress. The heartbeat thread is an infinite loop independent of UI input, so its stopping means:
- Global lock contention — a downstream sink of `tracing::info!` wedged, back-pressuring every caller including `mem-heartbeat`.
- OS process suspension (Task Manager → Suspend, or a debugger attachment).
- A kernel-mode driver bug that paused the process (rare but happens with GPU/display drivers, which CEF exercises).
The sidecar dying 5 minutes earlier than the host argues against a simultaneous OS suspend — this was a slow cascade, first sidecar, then host.
Gap in the current spec
`SPEC_GRACEFUL_CRASH_HANDLING_2026_04_13.md` addresses crashes that terminate a process — it catches render-process termination, unhandled JS, sidecar exit. It does not catch crash #2 (host keeps its PID but stops executing). The launcher waits on the host PID, not on its liveness.
Proposed follow-up: Step 7 — host-hang watchdog. A dedicated thread in the sidecar (the only process we can be sure stays alive when the host hangs) that wakes once per minute and checks the host log's mtime. If it hasn't moved in N minutes while the host PID is still alive, log a hang warning and optionally kill the host PID so the launcher can re-spawn it.
Caveat: user-idle (minimized window, locked screen) shouldn't trigger the watchdog. The host's memory heartbeat runs independent of UI input every 20s, so it's a reliable liveness signal — if that stops for >2 minutes, the process is wedged, not idle.
Action items
Artifacts
All under `~/.agentmux/logs/`:
- `agentmux-host-v0.33.130.log.2026-04-13` (419 KB, 1202 lines, last at 23:59:53 UTC)
- `agentmux-host-v0.33.130.log.2026-04-14` (657 KB, 2314 lines, last at 06:36:14 UTC)
- `agentmuxsrv-v0.33.130.log.2026-04-13` (128 KB)
- `agentmuxsrv-v0.33.130.log.2026-04-14` (1.7 KB, hourly sweeps, stops at 06:31:33 UTC)
- `agentmux-launcher.log` — last entry `v0.33.130 spawning CEF host with 0 args` at epoch 1776119493 (15:31:33 PDT); no exit line
No WER crash dumps for this session. No CEF Crashpad dumps for this session.
Labels: `bug`, `crash`, `postmortem`
Two crash signatures observed on a single long-running session (v0.33.130 portable, launched 15:31 PDT 2026-04-13, reported while drafting the graceful-crash-handling spec). Both processes are still alive in `tasklist` at report time — log output frozen for hours, but PIDs 14732 (host), 22804 + 3880 (sidecar), 15196 (launcher) all present.
Full post-mortem: `docs/analysis/crashes-2026-04-13.md`
Related spec: `SPEC_GRACEFUL_CRASH_HANDLING_2026_04_13.md`
In-flight PR: #375 (catches crash #1 directly)
Timeline
Crash #1 — CEF renderer subprocess died (the "white screen")
Signature:
Root cause (likely): Chromium renderer-internal failure — V8 heap OOM inside the renderer's own address space, a JS panic, or a Blink-side bug. 78-minute session isn't particularly long, but long agent sessions accumulate DOM + JS objects inside the capped per-process heap even with `content-visibility: auto`.
Why no recovery today: `agentmux-cef/src/client.rs` has no `CefRequestHandler` — so `on_render_process_terminated` never fires on our side. Window stays white, last paint frozen, no way for the user to recover except closing via Task Manager.
Fix in flight: PR #375 (step 1 of SPEC_GRACEFUL_CRASH_HANDLING). Adds `AgentMuxRequestHandler`, logs the termination as `target="crash", kind="renderer_terminated"`, and loads a self-contained recovery HTML page (Reload / Quit buttons) as a `data:` URL on the dead browser's main frame.
Crash #2 — Host process hung (still alive, zero log output)
Signature:
Interpretation: This is a hang, not a crash. All threads in the host process (UI, heartbeat, tracing, IPC) stopped making forward progress. The heartbeat thread is an infinite loop independent of UI input, so its stopping means:
The sidecar dying 5 minutes earlier than the host argues against a simultaneous OS suspend — this was a slow cascade, first sidecar, then host.
Gap in the current spec
`SPEC_GRACEFUL_CRASH_HANDLING_2026_04_13.md` addresses crashes that terminate a process — it catches render-process termination, unhandled JS, sidecar exit. It does not catch crash #2 (host keeps its PID but stops executing). The launcher waits on the host PID, not on its liveness.
Proposed follow-up: Step 7 — host-hang watchdog. A dedicated thread in the sidecar (the only process we can be sure stays alive when the host hangs) that wakes once per minute and checks the host log's mtime. If it hasn't moved in N minutes while the host PID is still alive, log a hang warning and optionally kill the host PID so the launcher can re-spawn it.
Caveat: user-idle (minimized window, locked screen) shouldn't trigger the watchdog. The host's memory heartbeat runs independent of UI input every 20s, so it's a reliable liveness signal — if that stops for >2 minutes, the process is wedged, not idle.
Action items
Artifacts
All under `~/.agentmux/logs/`:
No WER crash dumps for this session. No CEF Crashpad dumps for this session.
Labels: `bug`, `crash`, `postmortem`