Skip to content

fix(bridge): kill full process group on CLI timeout to prevent orphaned subprocesses#466

Open
nlenepveu wants to merge 2 commits intoasheshgoplani:mainfrom
blackfuel-ai:fix/run-cli-orphaned-subprocess
Open

fix(bridge): kill full process group on CLI timeout to prevent orphaned subprocesses#466
nlenepveu wants to merge 2 commits intoasheshgoplani:mainfrom
blackfuel-ai:fix/run-cli-orphaned-subprocess

Conversation

@nlenepveu
Copy link
Copy Markdown
Contributor

Problem

When run_cli times out, it kills the agent-deck process — but any grandchild processes spawned by agent-deck (most notably tmux send-keys -l) survive as orphans. SIGKILL does not propagate to children in a different process group.

These orphans continue feeding characters into the tmux pane's PTY input buffer. On a narrow PTY the buffer fills and subsequent tmux send-keys calls from the next heartbeat or user message block indefinitely, effectively freezing the conductor until manually killed.

Observed failure mode: Bridge and systemd heartbeat both stuck in tmux send-keys -l for 2+ days, conductor unreachable, no Telegram/Slack responses.

Fix

  • start_new_session=True on subprocess.run puts agent-deck in its own process group, isolated from the bridge
  • os.killpg(os.getpgid(proc.pid), signal.SIGKILL) on TimeoutExpired kills the entire group including all grandchildren
  • Fallback to proc.kill() if the process group is already gone (race condition)

Test plan

  • Manually verify: send a message that causes session send --wait to time out → confirm no tmux send-keys orphan survives (ps aux | grep send-keys)
  • Subsequent sends to the same pane complete normally after a timeout

…ed subprocesses

When run_cli times out and kills agent-deck, grandchild processes spawned
by agent-deck (e.g. tmux send-keys -l) survive as orphans because SIGKILL
does not propagate to children in a different process group. These orphans
continue feeding characters into the tmux pane's PTY input buffer, jamming
subsequent sends indefinitely.

Fix:
- start_new_session=True on subprocess.run puts agent-deck in its own
  process group, isolated from the bridge's group
- os.killpg() on TimeoutExpired kills the entire process group, including
  all grandchildren
- Fallback to proc.kill() if the process group is already gone
…roup kill

subprocess.run() raises TimeoutExpired without a .proc attribute —
exc.proc is only set by Popen.communicate(timeout=). Replace with
Popen + communicate(timeout=) so we can reliably kill the process
group on timeout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant