Symptom
When the laptop lid is closed for a while, or the Chrome tab/window is backgrounded for an extended period, returning to the Web UI leaves it in a broken state during an in-progress response:
- Status still shows
running (or stays stuck mid-stream).
- The user's most recent message (sent before the suspend) is missing from the chat history.
- The currently-running tool / streaming text is not restored.
- Only a manual page refresh recovers the state. Subsequent WS events arrive normally.
Repro
- Send a long-running prompt that takes ~30s+ (e.g. a multi-step orchestration or a heavy boss reply).
- Close the laptop lid (or switch Chrome to a different app and leave it idle for several minutes).
- Wait until the OS suspends or Chrome heavily throttles the tab.
- Resume — focus the Web UI tab again.
Expected: the in-progress response and its tool log resume from the snapshot; latest user message bubble is visible.
Actual: chat is frozen on pre-suspend state, latest user bubble missing, tool not progressing, requires F5.
Suspected root cause (client side)
public/js/ws.ts registers visibility/focus/pageshow restore hooks that call syncOrchestrateSnapshot(reason) without hydrateRun: true, so:
hydrateActiveRun(snap.activeRun) is not invoked on re-focus → in-progress agent bubble & tool log are not rebuilt.
loadMessages() is not called → the latest user message (which was queued/streamed in just before suspend) is not re-fetched from /api/messages.
Both paths only run inside state.ws.onopen (i.e. only after a real WebSocket reconnect). If the OS/browser keeps the WS object in OPEN state during the suspend (no onclose fires), the restore hook silently does the light sync only.
Additionally, there is no WS ping/pong heartbeat (server server.ts WebSocketServer and src/core/bus.ts have no ping/pong/isAlive/terminate logic). After a long suspend the TCP socket can be silently dead while the client still believes the WS is open, so reconnect-driven recovery never fires.
Server-side data is already there
/api/orchestrate/snapshot already returns activeRun populated by getLiveRun(scope) in src/agent/live-run-state.ts — text, toolLog, cli, running flag — so a client-side fix can rehydrate without protocol changes.
Proposed direction (to be confirmed in devlog _plan)
- On
visibilitychange/focus/pageshow/resume, additionally call loadMessages() and hydrateActiveRun (i.e. pass hydrateRun: true and reload history) when the page was hidden long enough.
- Add WS keepalive: server-side
ping interval + client pong handler with stale-socket termination, so a dead WS triggers onclose and the existing reconnect path runs.
- Detect long visibility gap (
Date.now() - lastVisibleAt > N) and force-reset WS via state.ws.close() to deterministically take the reconnect path.
Out of scope
- Tool-call / orchestration semantics on the server.
- Boss/employee dispatch behavior.
- Manager dashboard refresh (already separate).
Devlog plan to follow under devlog/_plan/260428_web_ui_resume_recovery/.
Symptom
When the laptop lid is closed for a while, or the Chrome tab/window is backgrounded for an extended period, returning to the Web UI leaves it in a broken state during an in-progress response:
running(or stays stuck mid-stream).Repro
Expected: the in-progress response and its tool log resume from the snapshot; latest user message bubble is visible.
Actual: chat is frozen on pre-suspend state, latest user bubble missing, tool not progressing, requires F5.
Suspected root cause (client side)
public/js/ws.tsregisters visibility/focus/pageshow restore hooks that callsyncOrchestrateSnapshot(reason)withouthydrateRun: true, so:hydrateActiveRun(snap.activeRun)is not invoked on re-focus → in-progress agent bubble & tool log are not rebuilt.loadMessages()is not called → the latest user message (which was queued/streamed in just before suspend) is not re-fetched from/api/messages.Both paths only run inside
state.ws.onopen(i.e. only after a real WebSocket reconnect). If the OS/browser keeps the WS object inOPENstate during the suspend (noonclosefires), the restore hook silently does the light sync only.Additionally, there is no WS ping/pong heartbeat (server
server.tsWebSocketServerandsrc/core/bus.tshave noping/pong/isAlive/terminatelogic). After a long suspend the TCP socket can be silently dead while the client still believes the WS is open, so reconnect-driven recovery never fires.Server-side data is already there
/api/orchestrate/snapshotalready returnsactiveRunpopulated bygetLiveRun(scope)insrc/agent/live-run-state.ts— text, toolLog, cli, running flag — so a client-side fix can rehydrate without protocol changes.Proposed direction (to be confirmed in devlog _plan)
visibilitychange/focus/pageshow/resume, additionally callloadMessages()andhydrateActiveRun(i.e. passhydrateRun: trueand reload history) when the page was hidden long enough.pinginterval + clientponghandler with stale-socket termination, so a dead WS triggersoncloseand the existing reconnect path runs.Date.now() - lastVisibleAt > N) and force-reset WS viastate.ws.close()to deterministically take the reconnect path.Out of scope
Devlog plan to follow under
devlog/_plan/260428_web_ui_resume_recovery/.