Observed
After force-quitting AgentMux on macOS, agentmuxsrv-rs (v0.32.43) stayed alive and consumed 338% CPU (nearly 4 cores maxed) for 33+ minutes. System was at 44% sys / 35% idle — entirely caused by the orphaned backend.
The kqueue parent watcher (added in PR #144) was present in the binary but did not kill the process.
Environment
- macOS, Apple M2, 8 cores
- Backend launched from DMG mount (
/Volumes/AgentMux/)
- Frontend was force-quit (Cmd-Q or force quit from Activity Monitor)
Why the backend didn't die
The kqueue parent watcher calls std::process::exit(0) when the parent PID exits. Possible failure modes:
- Parent PID was already 1 (launchd) at startup — kqueue watches launchd which never exits, falls back to PPID polling which also sees PID 1 → no detection
std::process::exit(0) deadlocked — if another thread holds a mutex when exit is called, the exit handlers can hang. Given the lock contention identified below, this is plausible
- kqueue event didn't fire — edge case in how macOS handles process reparenting from DMG-launched apps
Why 338% CPU
Code review identified multiple subsystems that continue running at full speed with zero connected clients:
Critical: Sysinfo loop publishes to zero subscribers
agentmuxsrv-rs/src/backend/sysinfo.rs:128-217
Runs every 1 second regardless of connected clients:
- Refreshes CPU, memory, disk, network metrics
- Serializes to JSON
- Acquires broker mutex, persists events (clone + Vec append + periodic realloc)
- Scans all subscriptions even with zero subscribers
- Enumerates all block PIDs and refreshes per-process metrics
- No early exit when zero clients are connected
High: EventBus polling every 20ms with mutex
agentmuxsrv-rs/src/backend/eventbus.rs:89-97
wait_for_connection() spins every 20ms, acquiring a mutex and scanning a HashMap. 50 lock acquisitions/sec per waiting task.
Medium: RPC Router polling every 30ms with mutex
agentmuxsrv-rs/src/backend/rpc/router.rs:185-198
Same busy-polling pattern as EventBus.
Medium: WebSocket handler drains channels to nowhere
agentmuxsrv-rs/src/server/websocket.rs:133-250
After WebSocket disconnects, the tokio::select! loop may continue draining event channels. Events are processed but never sent.
Medium: Broker persist memory churn
agentmuxsrv-rs/src/backend/wps.rs:253-282
Every published event is cloned multiple times. Every 10,240 appends, the entire Vec is reallocated.
Medium: Subagent watcher tight try_recv loop
agentmuxsrv-rs/src/backend/subagent_watcher.rs:206-231
Debounce drain uses while let Ok(p) = rx.try_recv() — tight loop with no yield.
Caveat
These hotspots were identified from code review only, not from profiling the actual running process. The real bottleneck could be something else. Next time this reproduces, run:
# Attach macOS profiler to the running backend
sample <PID> 5 -f /tmp/agentmux-cpu-profile.txt
This gives a real stack trace showing where CPU is actually spent.
Suggested fixes (pending profiling confirmation)
- Sysinfo loop: Skip collection when zero WebSocket clients are connected
- EventBus/RPC Router: Replace 20ms/30ms polling with
tokio::sync::Notify
- Broker publish: Early return when no subscribers exist
- WebSocket handler: Exit select loop cleanly when connection dies
These make the backend idle quietly when no clients are connected, rather than shutting down (which risks killing sessions during transient disconnects).
Related
Observed
After force-quitting AgentMux on macOS,
agentmuxsrv-rs(v0.32.43) stayed alive and consumed 338% CPU (nearly 4 cores maxed) for 33+ minutes. System was at 44% sys / 35% idle — entirely caused by the orphaned backend.The kqueue parent watcher (added in PR #144) was present in the binary but did not kill the process.
Environment
/Volumes/AgentMux/)Why the backend didn't die
The kqueue parent watcher calls
std::process::exit(0)when the parent PID exits. Possible failure modes:std::process::exit(0)deadlocked — if another thread holds a mutex when exit is called, the exit handlers can hang. Given the lock contention identified below, this is plausibleWhy 338% CPU
Code review identified multiple subsystems that continue running at full speed with zero connected clients:
Critical: Sysinfo loop publishes to zero subscribers
agentmuxsrv-rs/src/backend/sysinfo.rs:128-217Runs every 1 second regardless of connected clients:
High: EventBus polling every 20ms with mutex
agentmuxsrv-rs/src/backend/eventbus.rs:89-97wait_for_connection()spins every 20ms, acquiring a mutex and scanning a HashMap. 50 lock acquisitions/sec per waiting task.Medium: RPC Router polling every 30ms with mutex
agentmuxsrv-rs/src/backend/rpc/router.rs:185-198Same busy-polling pattern as EventBus.
Medium: WebSocket handler drains channels to nowhere
agentmuxsrv-rs/src/server/websocket.rs:133-250After WebSocket disconnects, the
tokio::select!loop may continue draining event channels. Events are processed but never sent.Medium: Broker persist memory churn
agentmuxsrv-rs/src/backend/wps.rs:253-282Every published event is cloned multiple times. Every 10,240 appends, the entire Vec is reallocated.
Medium: Subagent watcher tight try_recv loop
agentmuxsrv-rs/src/backend/subagent_watcher.rs:206-231Debounce drain uses
while let Ok(p) = rx.try_recv()— tight loop with no yield.Caveat
These hotspots were identified from code review only, not from profiling the actual running process. The real bottleneck could be something else. Next time this reproduces, run:
This gives a real stack trace showing where CPU is actually spent.
Suggested fixes (pending profiling confirmation)
tokio::sync::NotifyThese make the backend idle quietly when no clients are connected, rather than shutting down (which risks killing sessions during transient disconnects).
Related
specs/SPEC_BACKEND_CPU_HOTSPOTS.md