Skip to content

fix(cluster): F5 isolation recovery — re-bootstrap when peer count hits 0#107

Merged
TickTockBent merged 2 commits intomainfrom
fix/85-rebootstrap-on-isolation
Apr 27, 2026
Merged

fix(cluster): F5 isolation recovery — re-bootstrap when peer count hits 0#107
TickTockBent merged 2 commits intomainfrom
fix/85-rebootstrap-on-isolation

Conversation

@TickTockBent
Copy link
Copy Markdown
Owner

Summary

Closes #85. A node that loses every peer had no in-band path back to the cluster — topology sync exchanges peer lists with known peers, so an empty peer set means there's no one to ask. The burn-in repro was stopping node-b/c briefly; node-a evicted both, then never rejoined until manual restart.

Relationship to #87

The F7 fix in #87 already covered the non-zero-peer slice of this issue: pre-#87 self leaked into the peer map, so len(p.peers) overstated by 1 and performTopologySync's threshold check was always satisfied even when the node was missing real peers — sync was permanently skipped. Post-#87 that path works.

What's left is the zero-real-peer case: even with honest counts, topology sync has no peer to broadcast to once the local set is empty. This PR adds the recovery path for that case.

Design (Go + TS, mirrored)

  • ClusterNode runs an isolation-recovery loop on a 30s ticker (matches the gossip health-check cadence — recovery starts within one tick of full eviction).
  • Each tick: if peer count is 0 and a seedProvider (Go) or rebootstrapFn (TS) is wired, re-bootstrap.
  • Polling rather than event-driven from removePeer: multi-eviction in a single pingPeers tick can fire several removals back-to-back. Triggering from each would either spawn duplicate recoveries or require single-flight machinery. Polling is simpler and correct.
  • Public network: re-bootstrap pulls from the omega refresher's current signed list. An isolated node picks up whatever roots are live now, not whatever it started with.
  • Private / REPRAM_PEERS: re-bootstrap reuses the static seed list the operator provided (captured by closure at startup).
  • The recovery method is extracted as CheckIsolationAndRecover (Go) / checkIsolationAndRecover (TS) so tests can drive recovery deterministically without timers.

What changed

File Change
internal/cluster/node.go seedProvider field, SetSeedProvider, runIsolationRecovery goroutine, CheckIsolationAndRecover method, IsolationRecoveryInterval constant
cmd/repram/main.go Wire seedProvider — public closes over refresher.Current().Nodes, private snapshots bootstrapNodes
repram-mcp/src/node/cluster.ts rebootstrapFn field, setRebootstrapFn, isolation timer in start(), checkIsolationAndRecover, ISOLATION_RECOVERY_INTERVAL_MS constant
repram-mcp/src/index.ts Wire rebootstrapFn — public closes over refresher.currentList.nodes, private snapshots seedPeers

Tests

  • internal/cluster/recovery_test.go (+4): happy-path recovery, no-op when peers exist, no-op without seedProvider, no-op on empty seeds.
  • repram-mcp/src/node/cluster.test.ts (+5): same coverage plus an error-swallowing test for a re-bootstrap that throws.

Suite results: Go all packages green; TS 369/369 (was 364, +5 new).

Out of scope (follow-ups)

  • WebSocket reattachment: transient nodes that isolate need their WS attachment to recover too. That's handled separately by TreeManager's goodbye/redirect logic. This PR is HTTP gossip recovery only.
  • Threshold tuning: current trigger is hard-coded zero. The issue body floats "configurable threshold" — deferrable until someone needs it.
  • Backoff under sustained outage: the 30s polling interval doubles as the de-facto retry backoff. If the cluster is fully down, re-bootstrap fails every 30s indefinitely. No exponential backoff added — keeps the design simpler. Easy to add later if seeds become rate-limited under sustained-outage scenarios.

Test plan

  • CI green
  • Live wire-compat (./test/live-wire-compat/run.sh) passes — adds confidence that the timer-driven path doesn't interfere with normal operation
  • Manual repro of the burn-in scenario: spin up 3-node cluster, stop 2 nodes, verify the third recovers when the cluster comes back

Closes

Closes #85

// ticktockbent

…ts 0

Closes #85. A node that loses every peer (e.g., 3 consecutive ping
failures eviction across all peers, or a brief network outage) had no
in-band path back to the cluster: topology sync exchanges peer lists
with *known* peers, so an empty peer set means there's no one to ask.
The burn-in's repro was stopping node-b/c briefly; node-a evicted both
and never rejoined until manual restart.

Note: the F7 fix in #87 already partially helped here. Pre-#87, self
leaked into the peer map, so `len(p.peers)` overstated by 1 and
performTopologySync's threshold check was always satisfied — sync was
permanently skipped on a node that had real peers but appeared "full"
to itself. Post-#87, that case works correctly. What's left is the
zero-real-peer case, which is what this PR addresses.

Design (Go + TS, mirrored):

- ClusterNode owns an isolation-recovery loop on a 30s ticker (matches
  the gossip health-check cadence). On each tick, if peer count is 0
  and a seedProvider/rebootstrapFn is wired, re-bootstrap.
- Polling rather than event-driven (from removePeer): multi-eviction
  in a single pingPeers tick can fire several removals back-to-back;
  triggering from each would either spawn duplicate recoveries or
  require single-flight machinery. Polling is simpler and correct.
- Public network: re-bootstrap pulls from the omega refresher's current
  signed list. So an isolated node picks up whatever roots are live
  *now*, not whatever it started with.
- Private / REPRAM_PEERS: re-bootstrap reuses the static seed list the
  operator provided.
- Tests use an extracted CheckIsolationAndRecover (Go) /
  checkIsolationAndRecover (TS) so recovery can be driven
  deterministically without timers.

Tests:
  internal/cluster/recovery_test.go (+4):
    happy path, no-op-with-peers, no-seed-provider, empty-seeds
  repram-mcp/src/node/cluster.test.ts (+5):
    same coverage plus an error-swallowing test for failed re-bootstrap

Suite results:
  Go: all packages green
  TS: 369/369 (was 364, +5 new)

Out-of-scope follow-ups:
  - WebSocket reattachment for transient nodes that isolate is handled
    elsewhere (TreeManager goodbye/redirect logic). This PR is HTTP
    gossip recovery only.
  - Threshold tuning: current trigger is hard-coded zero. The issue
    body floats "configurable threshold" as an option — deferrable.

Closes #85

// ticktockbent
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
repram Ready Ready Preview, Comment Apr 27, 2026 3:25pm

Request Review

Addresses suggestions 1, 3, 5 from PR #107 cold review. Suggestion 2
(TreeManager ungraceful-disconnect gap) is a pre-existing issue that the
PR body misframed as covered; filed as #108 with full repro and proposed
fix path so the misframing becomes truthful by reference.

Suggestion 1 — single-flight on checkIsolationAndRecover (TS):
  setInterval fires every 30s; the async re-bootstrap can take longer
  on slow networks. Without a guard, a second tick can launch a second
  concurrent recovery and produce duplicate addPeer calls. Added a
  `recovering` flag + try/finally release. Go is naturally single-flight
  because the goroutine blocks on Bootstrap() before the next ticker
  fires, so no Go-side change.

Suggestion 3 — TS test for the timer-driven path:
  Added two tests using vitest fake timers:
    - guards against overlapping recovery cycles (single-flight)
    - invokes recovery on the setInterval tick and stops on .stop()
  These exercise the dispatch path that was previously untested at
  unit level.

Suggestion 5 — TS comment parity for private-seed snapshot:
  Mirrors the Go-side comment noting that the seed list captured at
  startup is not refreshed at runtime. Operators rotating seeds need to
  restart the node.

Skipped:
  Suggestion 4 (currentList event-loop guarantee comment) — very minor.

Suite results:
  TS: 371/371 (was 369, +2 new)

// ticktockbent
@TickTockBent
Copy link
Copy Markdown
Owner Author

Update — review fixes pushed (`83e52d6`)

  • Suggestion 1 (single-flight, TS): added a `recovering` flag with try/finally release in `checkIsolationAndRecover`. Without it, a slow re-bootstrap could outlast the 30s timer tick and produce duplicate `addPeer` calls. Go is naturally single-flight (goroutine blocks on `Bootstrap()` before next ticker fires), so no Go change.
  • Suggestion 3 (TS timer test): two new tests using vitest fake timers — overlapping-recovery guard, plus the timer-fires-and-stops dispatch path.
  • Suggestion 5 (TS private-seed comment): mirrored the Go comment noting the seed list captured at startup isn't refreshed at runtime.

Suggestion 2 (TreeManager misframing): the PR body claimed "WebSocket reattachment is handled by TreeManager's goodbye/redirect logic" — only graceful path actually has that. Ungraceful disconnects (TCP drop, crash, NAT rebind) leave the transient stranded. Filed as #108 with full repro, impact analysis, and proposed fix. Treat the original "WS reattachment is handled" line in this PR's description as referring only to the graceful path; #108 captures the ungraceful gap as a separate, pre-existing issue.

Suggestion 4 (`currentList` event-loop comment): skipped per author judgment — very minor.

Suite results: TS 371/371 (was 369, +2 new). Go unchanged.

@TickTockBent TickTockBent merged commit 3ca2fa9 into main Apr 27, 2026
4 checks passed
@TickTockBent TickTockBent deleted the fix/85-rebootstrap-on-isolation branch April 27, 2026 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

F5: topology sync doesn't refresh peers after eviction → cluster can lose all peers permanently

1 participant