Fix: Enforce shell tool timeout and kill process group on expiry#9
Merged
AnthonyRonning merged 2 commits intoAnthonyRonning:masterfrom Feb 9, 2026
Conversation
The shell tool parsed the timeout parameter but never used it -- commands ran via std::process::Command::output() which blocks indefinitely. This caused the agent to freeze whenever the LLM launched a long-running or persistent process (e.g. a daemon). Switch to tokio::process::Command with tokio::time::timeout() for actual enforcement. Spawn children in their own process group (.process_group(0)) so kill(-pgid, SIGKILL) reaches all descendants on timeout. Update the tool description to warn the LLM against launching background daemons. Fixes AnthonyRonning#8
Owner
|
I think the agent should manage their own timeouts, defaulted to 60s, and the command output should still show what was done before it was killed |
Address review feedback: - On timeout, drain stdout/stderr pipes before returning so the agent sees what the command produced before it was killed - Remove 300s hard cap on timeout (now 24h safety rail); agent sets whatever timeout is appropriate, default remains 60s - Clean up tool description: no more lecturing about daemons - Reap zombie process after SIGKILL to avoid leaking PIDs - Extract drain_pipe/drain_stderr/format_output helpers
Contributor
Author
|
Good call on both points. Pushed a fix:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #8
The shell tool's
timeoutparameter was parsed but never enforced. Commands were executed viastd::process::Command::output(), a synchronous blocking call with no time limit. If the LLM launched a long-running or persistent process (e.g. a daemon like Syncthing),output()would block indefinitely, freezing the entire agent event loop. Sage would continue receiving Signal messages but could never process or respond to any of them.Root Cause
In
crates/sage-core/src/shell_tool.rs, the timeout value was parsed at line 98-102 and logged, but the actual execution at line 123-128 usedstd::process::Command::output()which blocks until the child process exits -- with no timeout mechanism whatsoever.Additionally, even if a timeout had been implemented, the standard
CommandAPI has no way to kill child process trees. A command likenohup ./daemon &spawns children that would survive killing only the top-level bash process.What Changed
crates/sage-core/src/shell_tool.rs:Replaced
std::process::Command(sync) withtokio::process::Command(async) -- the command is now spawned as an async child process instead of blocking the thread.Wrapped execution in
tokio::time::timeout()-- the parsed timeout value is now actually enforced. Default remains 60s. The agent manages its own timeouts by setting the timeout parameter appropriately for each command.Spawn children in their own process group (
.process_group(0)) -- on timeout,kill(-pgid, SIGKILL)is sent to the entire process group, killing all descendants (background processes, daemons, etc.), not just the top-level shell. Zombie processes are reaped after the kill.Partial output returned on timeout -- stdout/stderr pipe handles are held separately from the child process. On timeout, after killing the process group, any buffered output is drained and returned as
STDOUT(partial) /STDERR(partial) so the agent can see what the command produced before it was killed.Updated tool description -- instructs the agent to set the timeout appropriately for each command. No hard cap on timeout duration (24h safety rail for nonsensical values only).
crates/sage-core/Cargo.toml:libc = "0.2"dependency forlibc::kill()to send signals to process groups.How It Was Tested
Environment
fix/shell-tool-timeout-enforcementbranchTest 1: Command completes within timeout
Input: Asked Sage via Signal to run
sleep 120(LLM chose timeout of 130s)Expected: Command completes normally after 120s
Result: PASS -- command ran for full 120 seconds, exited with code 0, Sage responded immediately after completion
Test 2: Command exceeds default timeout
Input: Asked Sage via Signal to run
sleep 999with default timeoutExpected: Command killed after 60s (default timeout), Sage returns timeout error and resumes processing
Result: PASS -- shell tool logged
Executing shell command: sleep 999 (timeout: 60s), command was killed at 60s, Sage receivedCommand timed out after 60serror and responded to the user within secondsTest 3: Agent not blocked during/after timeout
Input: Sent follow-up messages while
sleep 999was runningExpected: Sage processes messages after timeout fires (not permanently stuck)
Result: PASS -- Sage responded immediately after the 60s timeout, no zombie processes left behind
Before vs After
Command::output())sleep 999nohup ./daemon &Production Context
This bug was discovered when the LLM decided to launch Syncthing (a file sync daemon) via the shell tool. The
nohup ./syncthing ... &command spawned a persistent process, and the parent bash shell never exited. Sage became completely unresponsive in Signal -- it received messages but could not process them. The only recovery was to manually exec into the container and kill the hung processes. The LLM would then immediately try to relaunch Syncthing, getting stuck again in an infinite loop. This happened 5+ times before this fix was implemented.