Skip to content

feat: Agent container lifecycle management (idle pause, auto-restart) #147

@raykao

Description

@raykao

feat: Agent container lifecycle management (idle pause, auto-restart)

Problem

In a multi-agent Docker architecture where each bot runs in its own container, there is no mechanism to manage container lifecycle based on activity. Idle agent containers consume memory and CPU indefinitely, and there is no auto-restart behaviour if a container exits unexpectedly.

Motivation

For a deployment running many agents (e.g. one per project or team), always-on containers are wasteful. A lifecycle model where:

  • Idle agents are paused or stopped after a configurable timeout
  • Agents are started or resumed on the next incoming message
  • Crashed agents are restarted automatically

...would make multi-agent deployments practical on modest hardware without manual intervention.

This also has a security benefit: a stopped container has no running process and a smaller attack surface.

Proposed Solution

Idle detection and pause/stop

The admin bridge monitors activity per agent. When no messages have been processed for a configurable idle timeout (agentIdleTimeout, e.g. 30 minutes):

  • The admin bridge calls the Docker API to pause or stop the agent container
  • The container state is updated in SQLite (status: paused | stopped)

Wake on message

When a new message arrives for a bot whose container is paused or stopped:

  1. Admin bridge detects the container is not running (checks SQLite state or Docker API)
  2. Admin bridge starts/unpauses the container
  3. A "waking up..." indicator is sent to the channel while the container initialises
  4. Message is delivered once the container is ready

Auto-restart on crash

If a container exits with a non-zero code, the admin bridge detects this via Docker event streaming and attempts a restart with backoff (e.g. 3 attempts, doubling delay). After max retries, the bot is marked as failed in SQLite and the admin agent notifies the configured admin channel.

Config

{
  "docker": {
    "agentIdleTimeout": 1800,
    "agentMaxRestarts": 3,
    "agentRestartBackoffSeconds": 10
  }
}

Lifecycle states

created -> starting -> running -> idle -> paused/stopped -> starting (on wake)
                              -> crashed -> restarting (with backoff) -> failed

Deliverables

  • Idle timeout tracking per agent in SQLite
  • Pause/stop on idle timeout via Docker API
  • Wake on incoming message (start/unpause + hold/indicator)
  • Crash detection via Docker event stream
  • Auto-restart with configurable backoff
  • Admin notification on max retries exceeded
  • Config schema: docker.agentIdleTimeout, docker.agentMaxRestarts, docker.agentRestartBackoffSeconds
  • Documentation: lifecycle states and configuration

Open Questions

  • Pause vs stop on idle: pause preserves memory state (faster resume) but still uses RAM; stop frees all resources but requires a cold start. Should this be configurable?
  • Should the "waking up" indicator be a Mattermost ephemeral message, or a temporary reply?
  • How should the admin bridge handle a message that arrives while a container is mid-restart?

Reported By

Agent (automated) - drafted collaboratively with user raykao

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions