Skip to content

Health restart loop has no backoff or max retry limit #11

@arcavenai

Description

@arcavenai

Summary

When a session fails healthcheck, the daemon restarts it immediately with no exponential backoff and no maximum retry count. If the failure is structural (e.g. missing binary), this creates an infinite restart loop that churns through pane IDs and log entries indefinitely.

Observed behavior

23:58:14 session demo/squad-agent-g1-0 running in pane %6
23:58:49 health: restarting session demo/squad-agent-g1-0 (failures=3, restarts=0)
23:58:49 session demo/squad-agent-g1-0 deleted
23:58:49 session demo/squad-agent-g1-0 running in pane %10
23:59:25 health: restarting session demo/squad-agent-g1-0 (failures=3, restarts=0)
... repeats forever ...

The restarts=0 counter never increments because each restart creates a new session object, resetting the counter.

Suggested improvements

  1. Exponential backoff: After each restart, increase the delay before the next attempt (e.g. 30s, 60s, 120s, 300s)
  2. Max restart limit: After N restarts (configurable in manifest), transition to a failed state and stop retrying
  3. CrashLoopBackOff state: Similar to Kubernetes — when a session keeps failing, mark it as CrashLoopBackOff so operators can see at a glance
  4. Preserve restart count across session replacement: Track restarts at the role/replica level, not the session level

Environment

  • marvel 0.1.0-alpha.20260417.040512.b72898e, aarch64, Debian 13

Metadata

Metadata

Assignees

No one assigned

    Labels

    area.controllerReconcilerbugSomething isn't workingtype.bugBroken behavior — something doesn't work as designed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions