-
Notifications
You must be signed in to change notification settings - Fork 6
feat: Agent container lifecycle management (idle pause, auto-restart) #147
Description
feat: Agent container lifecycle management (idle pause, auto-restart)
Problem
In a multi-agent Docker architecture where each bot runs in its own container, there is no mechanism to manage container lifecycle based on activity. Idle agent containers consume memory and CPU indefinitely, and there is no auto-restart behaviour if a container exits unexpectedly.
Motivation
For a deployment running many agents (e.g. one per project or team), always-on containers are wasteful. A lifecycle model where:
- Idle agents are paused or stopped after a configurable timeout
- Agents are started or resumed on the next incoming message
- Crashed agents are restarted automatically
...would make multi-agent deployments practical on modest hardware without manual intervention.
This also has a security benefit: a stopped container has no running process and a smaller attack surface.
Proposed Solution
Idle detection and pause/stop
The admin bridge monitors activity per agent. When no messages have been processed for a configurable idle timeout (agentIdleTimeout, e.g. 30 minutes):
- The admin bridge calls the Docker API to pause or stop the agent container
- The container state is updated in SQLite (
status: paused | stopped)
Wake on message
When a new message arrives for a bot whose container is paused or stopped:
- Admin bridge detects the container is not running (checks SQLite state or Docker API)
- Admin bridge starts/unpauses the container
- A "waking up..." indicator is sent to the channel while the container initialises
- Message is delivered once the container is ready
Auto-restart on crash
If a container exits with a non-zero code, the admin bridge detects this via Docker event streaming and attempts a restart with backoff (e.g. 3 attempts, doubling delay). After max retries, the bot is marked as failed in SQLite and the admin agent notifies the configured admin channel.
Config
{
"docker": {
"agentIdleTimeout": 1800,
"agentMaxRestarts": 3,
"agentRestartBackoffSeconds": 10
}
}Lifecycle states
created -> starting -> running -> idle -> paused/stopped -> starting (on wake)
-> crashed -> restarting (with backoff) -> failed
Deliverables
- Idle timeout tracking per agent in SQLite
- Pause/stop on idle timeout via Docker API
- Wake on incoming message (start/unpause + hold/indicator)
- Crash detection via Docker event stream
- Auto-restart with configurable backoff
- Admin notification on max retries exceeded
- Config schema:
docker.agentIdleTimeout,docker.agentMaxRestarts,docker.agentRestartBackoffSeconds - Documentation: lifecycle states and configuration
Open Questions
- Pause vs stop on idle: pause preserves memory state (faster resume) but still uses RAM; stop frees all resources but requires a cold start. Should this be configurable?
- Should the "waking up" indicator be a Mattermost ephemeral message, or a temporary reply?
- How should the admin bridge handle a message that arrives while a container is mid-restart?
Reported By
Agent (automated) - drafted collaboratively with user raykao