bug: backend burns 338% CPU after frontend force-quit

## Observed

After force-quitting AgentMux on macOS, `agentmuxsrv-rs` (v0.32.43) stayed alive and consumed **338% CPU** (nearly 4 cores maxed) for 33+ minutes. System was at 44% sys / 35% idle — entirely caused by the orphaned backend.

The kqueue parent watcher (added in PR #144) was present in the binary but did not kill the process.

## Environment

- macOS, Apple M2, 8 cores
- Backend launched from DMG mount (`/Volumes/AgentMux/`)
- Frontend was force-quit (Cmd-Q or force quit from Activity Monitor)

## Why the backend didn't die

The kqueue parent watcher calls `std::process::exit(0)` when the parent PID exits. Possible failure modes:

1. **Parent PID was already 1 (launchd)** at startup — kqueue watches launchd which never exits, falls back to PPID polling which also sees PID 1 → no detection
2. **`std::process::exit(0)` deadlocked** — if another thread holds a mutex when exit is called, the exit handlers can hang. Given the lock contention identified below, this is plausible
3. **kqueue event didn't fire** — edge case in how macOS handles process reparenting from DMG-launched apps

## Why 338% CPU

Code review identified multiple subsystems that continue running at full speed with zero connected clients:

### Critical: Sysinfo loop publishes to zero subscribers
`agentmuxsrv-rs/src/backend/sysinfo.rs:128-217`

Runs every 1 second regardless of connected clients:
- Refreshes CPU, memory, disk, network metrics
- Serializes to JSON
- Acquires broker mutex, persists events (clone + Vec append + periodic realloc)
- Scans all subscriptions even with zero subscribers
- Enumerates all block PIDs and refreshes per-process metrics
- **No early exit when zero clients are connected**

### High: EventBus polling every 20ms with mutex
`agentmuxsrv-rs/src/backend/eventbus.rs:89-97`

`wait_for_connection()` spins every 20ms, acquiring a mutex and scanning a HashMap. 50 lock acquisitions/sec per waiting task.

### Medium: RPC Router polling every 30ms with mutex
`agentmuxsrv-rs/src/backend/rpc/router.rs:185-198`

Same busy-polling pattern as EventBus.

### Medium: WebSocket handler drains channels to nowhere
`agentmuxsrv-rs/src/server/websocket.rs:133-250`

After WebSocket disconnects, the `tokio::select!` loop may continue draining event channels. Events are processed but never sent.

### Medium: Broker persist memory churn
`agentmuxsrv-rs/src/backend/wps.rs:253-282`

Every published event is cloned multiple times. Every 10,240 appends, the entire Vec is reallocated.

### Medium: Subagent watcher tight try_recv loop
`agentmuxsrv-rs/src/backend/subagent_watcher.rs:206-231`

Debounce drain uses `while let Ok(p) = rx.try_recv()` — tight loop with no yield.

## Caveat

These hotspots were identified from **code review only**, not from profiling the actual running process. The real bottleneck could be something else. Next time this reproduces, run:

```bash
# Attach macOS profiler to the running backend
sample <PID> 5 -f /tmp/agentmux-cpu-profile.txt
```

This gives a real stack trace showing where CPU is actually spent.

## Suggested fixes (pending profiling confirmation)

1. **Sysinfo loop:** Skip collection when zero WebSocket clients are connected
2. **EventBus/RPC Router:** Replace 20ms/30ms polling with `tokio::sync::Notify`
3. **Broker publish:** Early return when no subscribers exist
4. **WebSocket handler:** Exit select loop cleanly when connection dies

These make the backend idle quietly when no clients are connected, rather than shutting down (which risks killing sessions during transient disconnects).

## Related

- PR #144 — kqueue parent watcher (should have prevented this)
- Spec: `specs/SPEC_BACKEND_CPU_HOTSPOTS.md`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: backend burns 338% CPU after frontend force-quit #183

Observed

Environment

Why the backend didn't die

Why 338% CPU

Critical: Sysinfo loop publishes to zero subscribers

High: EventBus polling every 20ms with mutex

Medium: RPC Router polling every 30ms with mutex

Medium: WebSocket handler drains channels to nowhere

Medium: Broker persist memory churn

Medium: Subagent watcher tight try_recv loop

Caveat

Suggested fixes (pending profiling confirmation)

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: backend burns 338% CPU after frontend force-quit #183

Description

Observed

Environment

Why the backend didn't die

Why 338% CPU

Critical: Sysinfo loop publishes to zero subscribers

High: EventBus polling every 20ms with mutex

Medium: RPC Router polling every 30ms with mutex

Medium: WebSocket handler drains channels to nowhere

Medium: Broker persist memory churn

Medium: Subagent watcher tight try_recv loop

Caveat

Suggested fixes (pending profiling confirmation)

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions