Skip to content

Reconciler does not respawn sessions after workspace delete + re-apply #29

@arcavenai

Description

@arcavenai

Summary

After marvel delete workspace <name> followed by marvel work <same-manifest>, the team is re-created but no sessions ever spawn. No events are emitted in the 60+ second window observed.

Reproduce (deterministic on this cluster)

  1. Start fresh daemon: marvel daemon --mrvl
  2. Apply crashloop manifest (any role that exits quickly works):
    workspace:
      name: crashloop
    teams:
      - name: flaky
        roles:
          - name: exit-fast
            replicas: 1
            runtime:
              image: shell
              command: sh
              args: ["-c", "exit 1"]
            restart_policy: always
            healthcheck:
              type: process-alive
              timeout: "2s"
              failure_threshold: 1
  3. Observe one session.created + one session.crashed event. Session sticks in crashed state (separate question whether restart_policy=always should retry in this case, but that's not the primary bug).
  4. marvel delete workspace crashloop — observe session.deleted event, workspace cleared.
  5. marvel work <same-manifest> — workspace/team is reported ready.
  6. Wait 60 seconds.

Expected

A fresh session.created event, new session in get sessions output.

Actual

  • marvel get teams shows the team (replicas: 1).
  • marvel describe team crashloop/flaky shows Generation 1, CreatedAt correctly updated, Role config correct.
  • marvel get sessions returns empty.
  • marvel events shows zero new events since the apply. Only the pre-delete events remain visible.

Suspected cause

RoleHealth/backoff state keyed by workspace/team/role may survive workspace delete+recreate, leaving BackoffUntil set far in the future (especially if the previous generation hit saturation in noteCrashAndBackoff). The reconciler's gate then refuses to spawn. But this is a guess — the symptom is "team exists, zero replicas actual, no events, no progress."

Alternatively the Session manager or reconcile queue is stalling after cascade delete (#15) — check whether the cascade cleared everything it needed to.

Environment

  • marvel 0.1.0-alpha.20260419.014522.b8f1f4b (commit b8f1f4b)
  • Fresh daemon, Linux aarch64 Pi, tmux 3.5a
  • mrvl:// (reproduces via Unix socket too)

Impact

Blocks testing of PR #21 / PR #24 crash-loop backoff telemetry — can't accumulate health.crashloop-backoff events if each workspace can only be applied once.

Workaround

Restart daemon between manifest applies.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area.controllerReconcilerarea.storeapi.Storepriority.p1High — blocks sprint worktriage.completeTriage done, ready for worktype.bugBroken behavior — something doesn't work as designed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions