Skip to content

fix: reap-path crash-loop backoff + Crashed marker (#10, #11)#24

Merged
arcaven merged 2 commits intomainfrom
fix/crashloop-backoff-reap-path
Apr 19, 2026
Merged

fix: reap-path crash-loop backoff + Crashed marker (#10, #11)#24
arcaven merged 2 commits intomainfrom
fix/crashloop-backoff-reap-path

Conversation

@arcaven
Copy link
Copy Markdown
Collaborator

@arcaven arcaven commented Apr 18, 2026

Summary

Closes the pair of reap-path regressions Skippy verified in alpha 0.1.0-alpha.20260418.025039.50c78ae:

Fix

Reap-path crash-loop bookkeeping (#11):

Converge both crash paths on a shared noteCrashAndBackoff helper so the same RoleHealth state and backoff semantics apply whether a crash arrives via stale heartbeat or via pane disappearance:

  • Health path (restartSession via evaluateHealth) still marks the session Failed on saturation without deleting, keeping it visible via marvel get sessions.
  • Reap path (ReconcileOnce via ReapDead) now records crashes against the role's RoleHealth before the reconciler decides whether to spawn.
  • Saturation on either path freezes BackoffUntil to the far future, so reconcileRole's existing backoff gate refuses further spawns.

ReapDead now returns []ReapedSession with role coordinates so the controller can attribute crashes to the right role.

SessionCrashed transient marker (#10):

New SessionCrashed state. ReapDead marks the session Crashed with PaneID=\"\" instead of deleting — operators see the transient via marvel get sessions for the full backoff window. The reconciler clears Crashed markers for a role as the last step before spawning a replacement.

  • SessionState.CountsAsAlive() predicate: Pending/Running/CrashLoopBackOff count toward a role's replica total; Crashed/Failed/Succeeded do not. reconcileRole uses this for actual, so Crashed markers don't block the respawn logic.
  • Manager.ReapDead caps the store at one Crashed marker per role (saturated roles can't accumulate ghosts).
  • Manager.ClearCrashedForRole is the explicit cleanup point the team controller calls when spawning.

Test plan

  • TestReapPathBumpsRestartCount — single pane-kill bumps RestartCount, sets backoff, leaves a Crashed marker visible
  • TestReapPathRespawnsAfterBackoff — after backoff elapses, respawn is Running and the Crashed marker is cleared
  • TestReapPathSaturatesMaxRestarts — MaxRestarts honored for reap-only crashes; no live replacement after saturation, at most one Crashed marker retained
  • TestReapDead (session pkg) — updated to assert Crashed marker semantics (not deleted)
  • TestReapDeadCapsCrashedMarkers (new) — pre-existing Crashed marker cleared when a new crash arrives
  • All existing health-path tests pass unchanged (TestHealthRestartAlways, ...BackoffHoldsReplacement, ...BackoffSiblingMarked, ...MaxReached) — shared helper preserves semantics
  • just ci equivalent: gofumpt + vet + golangci-lint (0 issues) + race detector

Refs: #10, #11, aae-orc-xhk, aae-orc-8ci

Skippy verified in #11 that `marvel inject ... "exit"`
triggered instant respawns with zero spacing and no restart-count
tracking — the pane-reaper code path bypassed restartSession entirely.
ReapDead deleted the session, the reconciler saw actual<desired, and
spawned a replacement immediately.

Converge both crash paths on a shared noteCrashAndBackoff helper:
- Health path (restartSession via evaluateHealth) still marks session
  Failed on saturation without deleting — keeps it visible via
  `marvel get sessions`.
- Reap path (ReconcileOnce via ReapDead) now records crashes against
  the role's RoleHealth before the reconciler decides whether to spawn.
- Saturation on either path freezes BackoffUntil to the far future so
  reconcileRole's existing backoff gate refuses further spawns. No
  separate MaxRestarts gate needed — preserves the health path's
  Failed-marker-in-store invariant.

ReapDead now returns []ReapedSession with role coordinates so the
controller can attribute the crash to the right role.

Refs: #11, aae-orc-xhk
Skippy verified #10 partial in alpha 0.1.0-alpha.20260418 — reap is
fast and logged, but `marvel get sessions` never shows the crashed/
exited transient because the daemon deleted the session from the store
immediately and respawned faster than an operator could refresh the CLI.

Introduce SessionCrashed: a transition state set by ReapDead when a
pane vanishes. The session stays in the store with PaneID cleared so
operators see it via `marvel get sessions` for the full backoff window.
When the reconciler spawns a replacement, it clears Crashed markers
for that role as the last step before Create — the fresh session is
the new truth.

- api.SessionState gains SessionCrashed, plus SessionState.CountsAsAlive
  helper. Only Pending, Running, and CrashLoopBackOff count toward a
  role's replica total; Crashed, Failed, and Succeeded do not.
- session.Manager.ReapDead marks Crashed instead of deleting. Caps the
  store at one Crashed marker per role (clearStaleCrashed) so the
  saturated-role case can't accumulate ghosts.
- session.Manager.ClearCrashedForRole is the explicit cleanup point
  the team controller calls when spawning a replacement.
- team.Controller.reconcileRole uses CountsAsAlive for `actual`, so
  Crashed markers don't block the respawn logic.

Refs: #10, aae-orc-8ci
@arcaven arcaven changed the title fix(health): route reap-path crashes through crash-loop backoff (#11) fix: reap-path crash-loop backoff + Crashed marker (#10, #11) Apr 18, 2026
@arcaven arcaven merged commit 3f385df into main Apr 19, 2026
7 checks passed
@arcaven arcaven added type.bug Broken behavior — something doesn't work as designed agent.worker PR created by a Claude Code worker agent area.controller Reconciler area.session Session lifecycle labels Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent.worker PR created by a Claude Code worker agent area.controller Reconciler area.session Session lifecycle type.bug Broken behavior — something doesn't work as designed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant