Skip to content

bug: backend burns 338% CPU after frontend force-quit #183

@a5af

Description

@a5af

Observed

After force-quitting AgentMux on macOS, agentmuxsrv-rs (v0.32.43) stayed alive and consumed 338% CPU (nearly 4 cores maxed) for 33+ minutes. System was at 44% sys / 35% idle — entirely caused by the orphaned backend.

The kqueue parent watcher (added in PR #144) was present in the binary but did not kill the process.

Environment

  • macOS, Apple M2, 8 cores
  • Backend launched from DMG mount (/Volumes/AgentMux/)
  • Frontend was force-quit (Cmd-Q or force quit from Activity Monitor)

Why the backend didn't die

The kqueue parent watcher calls std::process::exit(0) when the parent PID exits. Possible failure modes:

  1. Parent PID was already 1 (launchd) at startup — kqueue watches launchd which never exits, falls back to PPID polling which also sees PID 1 → no detection
  2. std::process::exit(0) deadlocked — if another thread holds a mutex when exit is called, the exit handlers can hang. Given the lock contention identified below, this is plausible
  3. kqueue event didn't fire — edge case in how macOS handles process reparenting from DMG-launched apps

Why 338% CPU

Code review identified multiple subsystems that continue running at full speed with zero connected clients:

Critical: Sysinfo loop publishes to zero subscribers

agentmuxsrv-rs/src/backend/sysinfo.rs:128-217

Runs every 1 second regardless of connected clients:

  • Refreshes CPU, memory, disk, network metrics
  • Serializes to JSON
  • Acquires broker mutex, persists events (clone + Vec append + periodic realloc)
  • Scans all subscriptions even with zero subscribers
  • Enumerates all block PIDs and refreshes per-process metrics
  • No early exit when zero clients are connected

High: EventBus polling every 20ms with mutex

agentmuxsrv-rs/src/backend/eventbus.rs:89-97

wait_for_connection() spins every 20ms, acquiring a mutex and scanning a HashMap. 50 lock acquisitions/sec per waiting task.

Medium: RPC Router polling every 30ms with mutex

agentmuxsrv-rs/src/backend/rpc/router.rs:185-198

Same busy-polling pattern as EventBus.

Medium: WebSocket handler drains channels to nowhere

agentmuxsrv-rs/src/server/websocket.rs:133-250

After WebSocket disconnects, the tokio::select! loop may continue draining event channels. Events are processed but never sent.

Medium: Broker persist memory churn

agentmuxsrv-rs/src/backend/wps.rs:253-282

Every published event is cloned multiple times. Every 10,240 appends, the entire Vec is reallocated.

Medium: Subagent watcher tight try_recv loop

agentmuxsrv-rs/src/backend/subagent_watcher.rs:206-231

Debounce drain uses while let Ok(p) = rx.try_recv() — tight loop with no yield.

Caveat

These hotspots were identified from code review only, not from profiling the actual running process. The real bottleneck could be something else. Next time this reproduces, run:

# Attach macOS profiler to the running backend
sample <PID> 5 -f /tmp/agentmux-cpu-profile.txt

This gives a real stack trace showing where CPU is actually spent.

Suggested fixes (pending profiling confirmation)

  1. Sysinfo loop: Skip collection when zero WebSocket clients are connected
  2. EventBus/RPC Router: Replace 20ms/30ms polling with tokio::sync::Notify
  3. Broker publish: Early return when no subscribers exist
  4. WebSocket handler: Exit select loop cleanly when connection dies

These make the backend idle quietly when no clients are connected, rather than shutting down (which risks killing sessions during transient disconnects).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions