fix: reap-path crash-loop backoff + Crashed marker (#10, #11)#24
Merged
fix: reap-path crash-loop backoff + Crashed marker (#10, #11)#24
Conversation
Skippy verified in #11 that `marvel inject ... "exit"` triggered instant respawns with zero spacing and no restart-count tracking — the pane-reaper code path bypassed restartSession entirely. ReapDead deleted the session, the reconciler saw actual<desired, and spawned a replacement immediately. Converge both crash paths on a shared noteCrashAndBackoff helper: - Health path (restartSession via evaluateHealth) still marks session Failed on saturation without deleting — keeps it visible via `marvel get sessions`. - Reap path (ReconcileOnce via ReapDead) now records crashes against the role's RoleHealth before the reconciler decides whether to spawn. - Saturation on either path freezes BackoffUntil to the far future so reconcileRole's existing backoff gate refuses further spawns. No separate MaxRestarts gate needed — preserves the health path's Failed-marker-in-store invariant. ReapDead now returns []ReapedSession with role coordinates so the controller can attribute the crash to the right role. Refs: #11, aae-orc-xhk
Skippy verified #10 partial in alpha 0.1.0-alpha.20260418 — reap is fast and logged, but `marvel get sessions` never shows the crashed/ exited transient because the daemon deleted the session from the store immediately and respawned faster than an operator could refresh the CLI. Introduce SessionCrashed: a transition state set by ReapDead when a pane vanishes. The session stays in the store with PaneID cleared so operators see it via `marvel get sessions` for the full backoff window. When the reconciler spawns a replacement, it clears Crashed markers for that role as the last step before Create — the fresh session is the new truth. - api.SessionState gains SessionCrashed, plus SessionState.CountsAsAlive helper. Only Pending, Running, and CrashLoopBackOff count toward a role's replica total; Crashed, Failed, and Succeeded do not. - session.Manager.ReapDead marks Crashed instead of deleting. Caps the store at one Crashed marker per role (clearStaleCrashed) so the saturated-role case can't accumulate ghosts. - session.Manager.ClearCrashedForRole is the explicit cleanup point the team controller calls when spawning a replacement. - team.Controller.reconcileRole uses CountsAsAlive for `actual`, so Crashed markers don't block the respawn logic. Refs: #10, aae-orc-8ci
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the pair of reap-path regressions Skippy verified in alpha 0.1.0-alpha.20260418.025039.50c78ae:
marvel get sessionsshowed nothing during the crashed/exited transient because the session was deleted from the store before an operator could see it.Fix
Reap-path crash-loop bookkeeping (#11):
Converge both crash paths on a shared
noteCrashAndBackoffhelper so the same RoleHealth state and backoff semantics apply whether a crash arrives via stale heartbeat or via pane disappearance:restartSessionviaevaluateHealth) still marks the sessionFailedon saturation without deleting, keeping it visible viamarvel get sessions.ReconcileOnceviaReapDead) now records crashes against the role'sRoleHealthbefore the reconciler decides whether to spawn.BackoffUntilto the far future, soreconcileRole's existing backoff gate refuses further spawns.ReapDeadnow returns[]ReapedSessionwith role coordinates so the controller can attribute crashes to the right role.SessionCrashed transient marker (#10):
New
SessionCrashedstate.ReapDeadmarks the sessionCrashedwithPaneID=\"\"instead of deleting — operators see the transient viamarvel get sessionsfor the full backoff window. The reconciler clears Crashed markers for a role as the last step before spawning a replacement.SessionState.CountsAsAlive()predicate: Pending/Running/CrashLoopBackOff count toward a role's replica total; Crashed/Failed/Succeeded do not.reconcileRoleuses this foractual, so Crashed markers don't block the respawn logic.Manager.ReapDeadcaps the store at one Crashed marker per role (saturated roles can't accumulate ghosts).Manager.ClearCrashedForRoleis the explicit cleanup point the team controller calls when spawning.Test plan
TestReapPathBumpsRestartCount— single pane-kill bumps RestartCount, sets backoff, leaves a Crashed marker visibleTestReapPathRespawnsAfterBackoff— after backoff elapses, respawn is Running and the Crashed marker is clearedTestReapPathSaturatesMaxRestarts— MaxRestarts honored for reap-only crashes; no live replacement after saturation, at most one Crashed marker retainedTestReapDead(session pkg) — updated to assert Crashed marker semantics (not deleted)TestReapDeadCapsCrashedMarkers(new) — pre-existing Crashed marker cleared when a new crash arrivesTestHealthRestartAlways,...BackoffHoldsReplacement,...BackoffSiblingMarked,...MaxReached) — shared helper preserves semanticsjust ciequivalent: gofumpt + vet + golangci-lint (0 issues) + race detectorRefs: #10, #11, aae-orc-xhk, aae-orc-8ci