Skip to content

Fix: Enforce shell tool timeout and kill process group on expiry#9

Merged
AnthonyRonning merged 2 commits intoAnthonyRonning:masterfrom
marksftw:fix/shell-tool-timeout-enforcement
Feb 9, 2026
Merged

Fix: Enforce shell tool timeout and kill process group on expiry#9
AnthonyRonning merged 2 commits intoAnthonyRonning:masterfrom
marksftw:fix/shell-tool-timeout-enforcement

Conversation

@marksftw
Copy link
Contributor

@marksftw marksftw commented Feb 8, 2026

Summary

Fixes #8

The shell tool's timeout parameter was parsed but never enforced. Commands were executed via std::process::Command::output(), a synchronous blocking call with no time limit. If the LLM launched a long-running or persistent process (e.g. a daemon like Syncthing), output() would block indefinitely, freezing the entire agent event loop. Sage would continue receiving Signal messages but could never process or respond to any of them.

Root Cause

In crates/sage-core/src/shell_tool.rs, the timeout value was parsed at line 98-102 and logged, but the actual execution at line 123-128 used std::process::Command::output() which blocks until the child process exits -- with no timeout mechanism whatsoever.

Additionally, even if a timeout had been implemented, the standard Command API has no way to kill child process trees. A command like nohup ./daemon & spawns children that would survive killing only the top-level bash process.

What Changed

crates/sage-core/src/shell_tool.rs:

  1. Replaced std::process::Command (sync) with tokio::process::Command (async) -- the command is now spawned as an async child process instead of blocking the thread.

  2. Wrapped execution in tokio::time::timeout() -- the parsed timeout value is now actually enforced. Default remains 60s. The agent manages its own timeouts by setting the timeout parameter appropriately for each command.

  3. Spawn children in their own process group (.process_group(0)) -- on timeout, kill(-pgid, SIGKILL) is sent to the entire process group, killing all descendants (background processes, daemons, etc.), not just the top-level shell. Zombie processes are reaped after the kill.

  4. Partial output returned on timeout -- stdout/stderr pipe handles are held separately from the child process. On timeout, after killing the process group, any buffered output is drained and returned as STDOUT (partial) / STDERR (partial) so the agent can see what the command produced before it was killed.

  5. Updated tool description -- instructs the agent to set the timeout appropriately for each command. No hard cap on timeout duration (24h safety rail for nonsensical values only).

crates/sage-core/Cargo.toml:

  • Added libc = "0.2" dependency for libc::kill() to send signals to process groups.

How It Was Tested

Environment

  • Host: Linux (Docker 29.2.0)
  • Sage image: rebuilt from fix/shell-tool-timeout-enforcement branch
  • All three containers healthy (sage, sage-postgres, sage-signal-cli)

Test 1: Command completes within timeout

Input: Asked Sage via Signal to run sleep 120 (LLM chose timeout of 130s)
Expected: Command completes normally after 120s
Result: PASS -- command ran for full 120 seconds, exited with code 0, Sage responded immediately after completion

Test 2: Command exceeds default timeout

Input: Asked Sage via Signal to run sleep 999 with default timeout
Expected: Command killed after 60s (default timeout), Sage returns timeout error and resumes processing
Result: PASS -- shell tool logged Executing shell command: sleep 999 (timeout: 60s), command was killed at 60s, Sage received Command timed out after 60s error and responded to the user within seconds

Test 3: Agent not blocked during/after timeout

Input: Sent follow-up messages while sleep 999 was running
Expected: Sage processes messages after timeout fires (not permanently stuck)
Result: PASS -- Sage responded immediately after the 60s timeout, no zombie processes left behind

Before vs After

Scenario Before (sync Command::output()) After (async + process group kill)
sleep 999 Agent blocked for 999s Killed at 60s, agent resumes with partial output
nohup ./daemon & Agent blocked forever Killed at timeout, agent resumes with partial output
Zombie processes Accumulated on each stuck command Process group killed cleanly
Long build (10+ min) Would have been capped at 300s Agent sets appropriate timeout, no artificial cap

Production Context

This bug was discovered when the LLM decided to launch Syncthing (a file sync daemon) via the shell tool. The nohup ./syncthing ... & command spawned a persistent process, and the parent bash shell never exited. Sage became completely unresponsive in Signal -- it received messages but could not process them. The only recovery was to manually exec into the container and kill the hung processes. The LLM would then immediately try to relaunch Syncthing, getting stuck again in an infinite loop. This happened 5+ times before this fix was implemented.

The shell tool parsed the timeout parameter but never used it --
commands ran via std::process::Command::output() which blocks
indefinitely. This caused the agent to freeze whenever the LLM
launched a long-running or persistent process (e.g. a daemon).

Switch to tokio::process::Command with tokio::time::timeout() for
actual enforcement. Spawn children in their own process group
(.process_group(0)) so kill(-pgid, SIGKILL) reaches all descendants
on timeout. Update the tool description to warn the LLM against
launching background daemons.

Fixes AnthonyRonning#8
@AnthonyRonning
Copy link
Owner

I think the agent should manage their own timeouts, defaulted to 60s, and the command output should still show what was done before it was killed

Address review feedback:
- On timeout, drain stdout/stderr pipes before returning so the agent
  sees what the command produced before it was killed
- Remove 300s hard cap on timeout (now 24h safety rail); agent sets
  whatever timeout is appropriate, default remains 60s
- Clean up tool description: no more lecturing about daemons
- Reap zombie process after SIGKILL to avoid leaking PIDs
- Extract drain_pipe/drain_stderr/format_output helpers
@marksftw
Copy link
Contributor Author

marksftw commented Feb 8, 2026

Good call on both points.

Pushed a fix:

  1. Agent manages timeouts — Removed the 300s hard cap. The agent is free to set whatever timeout it needs (default stays 60s). There's a 24-hour ceiling as a sanity check against nonsensical values, but happy to remove that entirely if you'd prefer no cap at all.
  2. Partial output on timeout — Restructured the execution to hold the stdout/stderr pipe handles separately (.take() before waiting). On timeout, after killing the process group, we drain whatever was buffered in the pipes and return it as STDOUT (partial) / STDERR (partial) along with the timeout notice. The agent now sees exactly what happened before the kill.
  3. Cleaned up tool description — Removed the lecturing "do NOT launch daemons" language. The description now just tells the agent to set the timeout appropriately and that partial output is returned if it expires.

@AnthonyRonning AnthonyRonning merged commit e8732a2 into AnthonyRonning:master Feb 9, 2026
7 checks passed
@marksftw marksftw deleted the fix/shell-tool-timeout-enforcement branch February 9, 2026 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Shell tool timeout is parsed but never enforced, allowing long-running commands to block the agent indefinitely

2 participants