Fix: Enforce shell tool timeout and kill process group on expiry by marksftw · Pull Request #9 · AnthonyRonning/sage

marksftw · 2026-02-08T03:44:04Z

Summary

Fixes #8

The shell tool's timeout parameter was parsed but never enforced. Commands were executed via std::process::Command::output(), a synchronous blocking call with no time limit. If the LLM launched a long-running or persistent process (e.g. a daemon like Syncthing), output() would block indefinitely, freezing the entire agent event loop. Sage would continue receiving Signal messages but could never process or respond to any of them.

Root Cause

In crates/sage-core/src/shell_tool.rs, the timeout value was parsed at line 98-102 and logged, but the actual execution at line 123-128 used std::process::Command::output() which blocks until the child process exits -- with no timeout mechanism whatsoever.

Additionally, even if a timeout had been implemented, the standard Command API has no way to kill child process trees. A command like nohup ./daemon & spawns children that would survive killing only the top-level bash process.

What Changed

crates/sage-core/src/shell_tool.rs:

Replaced std::process::Command (sync) with tokio::process::Command (async) -- the command is now spawned as an async child process instead of blocking the thread.
Wrapped execution in tokio::time::timeout() -- the parsed timeout value is now actually enforced. Default remains 60s. The agent manages its own timeouts by setting the timeout parameter appropriately for each command.
Spawn children in their own process group (.process_group(0)) -- on timeout, kill(-pgid, SIGKILL) is sent to the entire process group, killing all descendants (background processes, daemons, etc.), not just the top-level shell. Zombie processes are reaped after the kill.
Partial output returned on timeout -- stdout/stderr pipe handles are held separately from the child process. On timeout, after killing the process group, any buffered output is drained and returned as STDOUT (partial) / STDERR (partial) so the agent can see what the command produced before it was killed.
Updated tool description -- instructs the agent to set the timeout appropriately for each command. No hard cap on timeout duration (24h safety rail for nonsensical values only).

crates/sage-core/Cargo.toml:

Added libc = "0.2" dependency for libc::kill() to send signals to process groups.

How It Was Tested

Environment

Host: Linux (Docker 29.2.0)
Sage image: rebuilt from fix/shell-tool-timeout-enforcement branch
All three containers healthy (sage, sage-postgres, sage-signal-cli)

Test 1: Command completes within timeout

Input: Asked Sage via Signal to run sleep 120 (LLM chose timeout of 130s)
Expected: Command completes normally after 120s
Result: PASS -- command ran for full 120 seconds, exited with code 0, Sage responded immediately after completion

Test 2: Command exceeds default timeout

Input: Asked Sage via Signal to run sleep 999 with default timeout
Expected: Command killed after 60s (default timeout), Sage returns timeout error and resumes processing
Result: PASS -- shell tool logged Executing shell command: sleep 999 (timeout: 60s), command was killed at 60s, Sage received Command timed out after 60s error and responded to the user within seconds

Test 3: Agent not blocked during/after timeout

Input: Sent follow-up messages while sleep 999 was running
Expected: Sage processes messages after timeout fires (not permanently stuck)
Result: PASS -- Sage responded immediately after the 60s timeout, no zombie processes left behind

Before vs After

Scenario	Before (sync `Command::output()`)	After (async + process group kill)
`sleep 999`	Agent blocked for 999s	Killed at 60s, agent resumes with partial output
`nohup ./daemon &`	Agent blocked forever	Killed at timeout, agent resumes with partial output
Zombie processes	Accumulated on each stuck command	Process group killed cleanly
Long build (10+ min)	Would have been capped at 300s	Agent sets appropriate timeout, no artificial cap

Production Context

This bug was discovered when the LLM decided to launch Syncthing (a file sync daemon) via the shell tool. The nohup ./syncthing ... & command spawned a persistent process, and the parent bash shell never exited. Sage became completely unresponsive in Signal -- it received messages but could not process them. The only recovery was to manually exec into the container and kill the hung processes. The LLM would then immediately try to relaunch Syncthing, getting stuck again in an infinite loop. This happened 5+ times before this fix was implemented.

The shell tool parsed the timeout parameter but never used it -- commands ran via std::process::Command::output() which blocks indefinitely. This caused the agent to freeze whenever the LLM launched a long-running or persistent process (e.g. a daemon). Switch to tokio::process::Command with tokio::time::timeout() for actual enforcement. Spawn children in their own process group (.process_group(0)) so kill(-pgid, SIGKILL) reaches all descendants on timeout. Update the tool description to warn the LLM against launching background daemons. Fixes AnthonyRonning#8

AnthonyRonning · 2026-02-08T18:26:55Z

I think the agent should manage their own timeouts, defaulted to 60s, and the command output should still show what was done before it was killed

Address review feedback: - On timeout, drain stdout/stderr pipes before returning so the agent sees what the command produced before it was killed - Remove 300s hard cap on timeout (now 24h safety rail); agent sets whatever timeout is appropriate, default remains 60s - Clean up tool description: no more lecturing about daemons - Reap zombie process after SIGKILL to avoid leaking PIDs - Extract drain_pipe/drain_stderr/format_output helpers

marksftw · 2026-02-08T21:10:20Z

Good call on both points.

Pushed a fix:

Agent manages timeouts — Removed the 300s hard cap. The agent is free to set whatever timeout it needs (default stays 60s). There's a 24-hour ceiling as a sanity check against nonsensical values, but happy to remove that entirely if you'd prefer no cap at all.
Partial output on timeout — Restructured the execution to hold the stdout/stderr pipe handles separately (.take() before waiting). On timeout, after killing the process group, we drain whatever was buffered in the pipes and return it as STDOUT (partial) / STDERR (partial) along with the timeout notice. The agent now sees exactly what happened before the kill.
Cleaned up tool description — Removed the lecturing "do NOT launch daemons" language. The description now just tells the agent to set the timeout appropriately and that partial output is returned if it expires.

AnthonyRonning merged commit e8732a2 into AnthonyRonning:master Feb 9, 2026
7 checks passed

marksftw deleted the fix/shell-tool-timeout-enforcement branch February 9, 2026 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Enforce shell tool timeout and kill process group on expiry#9

Fix: Enforce shell tool timeout and kill process group on expiry#9
AnthonyRonning merged 2 commits intoAnthonyRonning:masterfrom
marksftw:fix/shell-tool-timeout-enforcement

marksftw commented Feb 8, 2026 •

edited

Loading

Uh oh!

AnthonyRonning commented Feb 8, 2026

Uh oh!

marksftw commented Feb 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marksftw commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

What Changed

How It Was Tested

Environment

Test 1: Command completes within timeout

Test 2: Command exceeds default timeout

Test 3: Agent not blocked during/after timeout

Before vs After

Production Context

Uh oh!

AnthonyRonning commented Feb 8, 2026

Uh oh!

marksftw commented Feb 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marksftw commented Feb 8, 2026 •

edited

Loading