feat: Agent container lifecycle management (idle pause, auto-restart)

# feat: Agent container lifecycle management (idle pause, auto-restart)

## Problem

In a multi-agent Docker architecture where each bot runs in its own container, there is no mechanism to manage container lifecycle based on activity. Idle agent containers consume memory and CPU indefinitely, and there is no auto-restart behaviour if a container exits unexpectedly.

## Motivation

For a deployment running many agents (e.g. one per project or team), always-on containers are wasteful. A lifecycle model where:

- Idle agents are paused or stopped after a configurable timeout
- Agents are started or resumed on the next incoming message
- Crashed agents are restarted automatically

...would make multi-agent deployments practical on modest hardware without manual intervention.

This also has a security benefit: a stopped container has no running process and a smaller attack surface.

## Proposed Solution

### Idle detection and pause/stop

The admin bridge monitors activity per agent. When no messages have been processed for a configurable idle timeout (`agentIdleTimeout`, e.g. 30 minutes):

- The admin bridge calls the Docker API to pause or stop the agent container
- The container state is updated in SQLite (`status: paused | stopped`)

### Wake on message

When a new message arrives for a bot whose container is paused or stopped:

1. Admin bridge detects the container is not running (checks SQLite state or Docker API)
2. Admin bridge starts/unpauses the container
3. A "waking up..." indicator is sent to the channel while the container initialises
4. Message is delivered once the container is ready

### Auto-restart on crash

If a container exits with a non-zero code, the admin bridge detects this via Docker event streaming and attempts a restart with backoff (e.g. 3 attempts, doubling delay). After max retries, the bot is marked as `failed` in SQLite and the admin agent notifies the configured admin channel.

### Config

```json
{
  "docker": {
    "agentIdleTimeout": 1800,
    "agentMaxRestarts": 3,
    "agentRestartBackoffSeconds": 10
  }
}
```

### Lifecycle states

```
created -> starting -> running -> idle -> paused/stopped -> starting (on wake)
                              -> crashed -> restarting (with backoff) -> failed
```

## Deliverables

- [ ] Idle timeout tracking per agent in SQLite
- [ ] Pause/stop on idle timeout via Docker API
- [ ] Wake on incoming message (start/unpause + hold/indicator)
- [ ] Crash detection via Docker event stream
- [ ] Auto-restart with configurable backoff
- [ ] Admin notification on max retries exceeded
- [ ] Config schema: `docker.agentIdleTimeout`, `docker.agentMaxRestarts`, `docker.agentRestartBackoffSeconds`
- [ ] Documentation: lifecycle states and configuration

## Open Questions

- Pause vs stop on idle: pause preserves memory state (faster resume) but still uses RAM; stop frees all resources but requires a cold start. Should this be configurable?
- Should the "waking up" indicator be a Mattermost ephemeral message, or a temporary reply?
- How should the admin bridge handle a message that arrives while a container is mid-restart?

## Reported By

Agent (automated) - drafted collaboratively with user raykao


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Agent container lifecycle management (idle pause, auto-restart) #147

feat: Agent container lifecycle management (idle pause, auto-restart)

Problem

Motivation

Proposed Solution

Idle detection and pause/stop

Wake on message

Auto-restart on crash

Config

Lifecycle states

Deliverables

Open Questions

Reported By

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: Agent container lifecycle management (idle pause, auto-restart) #147

Description

feat: Agent container lifecycle management (idle pause, auto-restart)

Problem

Motivation

Proposed Solution

Idle detection and pause/stop

Wake on message

Auto-restart on crash

Config

Lifecycle states

Deliverables

Open Questions

Reported By

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions