fix(cluster): F5 isolation recovery — re-bootstrap when peer count hits 0#107
fix(cluster): F5 isolation recovery — re-bootstrap when peer count hits 0#107TickTockBent merged 2 commits intomainfrom
Conversation
…ts 0 Closes #85. A node that loses every peer (e.g., 3 consecutive ping failures eviction across all peers, or a brief network outage) had no in-band path back to the cluster: topology sync exchanges peer lists with *known* peers, so an empty peer set means there's no one to ask. The burn-in's repro was stopping node-b/c briefly; node-a evicted both and never rejoined until manual restart. Note: the F7 fix in #87 already partially helped here. Pre-#87, self leaked into the peer map, so `len(p.peers)` overstated by 1 and performTopologySync's threshold check was always satisfied — sync was permanently skipped on a node that had real peers but appeared "full" to itself. Post-#87, that case works correctly. What's left is the zero-real-peer case, which is what this PR addresses. Design (Go + TS, mirrored): - ClusterNode owns an isolation-recovery loop on a 30s ticker (matches the gossip health-check cadence). On each tick, if peer count is 0 and a seedProvider/rebootstrapFn is wired, re-bootstrap. - Polling rather than event-driven (from removePeer): multi-eviction in a single pingPeers tick can fire several removals back-to-back; triggering from each would either spawn duplicate recoveries or require single-flight machinery. Polling is simpler and correct. - Public network: re-bootstrap pulls from the omega refresher's current signed list. So an isolated node picks up whatever roots are live *now*, not whatever it started with. - Private / REPRAM_PEERS: re-bootstrap reuses the static seed list the operator provided. - Tests use an extracted CheckIsolationAndRecover (Go) / checkIsolationAndRecover (TS) so recovery can be driven deterministically without timers. Tests: internal/cluster/recovery_test.go (+4): happy path, no-op-with-peers, no-seed-provider, empty-seeds repram-mcp/src/node/cluster.test.ts (+5): same coverage plus an error-swallowing test for failed re-bootstrap Suite results: Go: all packages green TS: 369/369 (was 364, +5 new) Out-of-scope follow-ups: - WebSocket reattachment for transient nodes that isolate is handled elsewhere (TreeManager goodbye/redirect logic). This PR is HTTP gossip recovery only. - Threshold tuning: current trigger is hard-coded zero. The issue body floats "configurable threshold" as an option — deferrable. Closes #85 // ticktockbent
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Addresses suggestions 1, 3, 5 from PR #107 cold review. Suggestion 2 (TreeManager ungraceful-disconnect gap) is a pre-existing issue that the PR body misframed as covered; filed as #108 with full repro and proposed fix path so the misframing becomes truthful by reference. Suggestion 1 — single-flight on checkIsolationAndRecover (TS): setInterval fires every 30s; the async re-bootstrap can take longer on slow networks. Without a guard, a second tick can launch a second concurrent recovery and produce duplicate addPeer calls. Added a `recovering` flag + try/finally release. Go is naturally single-flight because the goroutine blocks on Bootstrap() before the next ticker fires, so no Go-side change. Suggestion 3 — TS test for the timer-driven path: Added two tests using vitest fake timers: - guards against overlapping recovery cycles (single-flight) - invokes recovery on the setInterval tick and stops on .stop() These exercise the dispatch path that was previously untested at unit level. Suggestion 5 — TS comment parity for private-seed snapshot: Mirrors the Go-side comment noting that the seed list captured at startup is not refreshed at runtime. Operators rotating seeds need to restart the node. Skipped: Suggestion 4 (currentList event-loop guarantee comment) — very minor. Suite results: TS: 371/371 (was 369, +2 new) // ticktockbent
|
Update — review fixes pushed (`83e52d6`)
Suggestion 2 (TreeManager misframing): the PR body claimed "WebSocket reattachment is handled by TreeManager's goodbye/redirect logic" — only graceful path actually has that. Ungraceful disconnects (TCP drop, crash, NAT rebind) leave the transient stranded. Filed as #108 with full repro, impact analysis, and proposed fix. Treat the original "WS reattachment is handled" line in this PR's description as referring only to the graceful path; #108 captures the ungraceful gap as a separate, pre-existing issue. Suggestion 4 (`currentList` event-loop comment): skipped per author judgment — very minor. Suite results: TS 371/371 (was 369, +2 new). Go unchanged. |
Summary
Closes #85. A node that loses every peer had no in-band path back to the cluster — topology sync exchanges peer lists with known peers, so an empty peer set means there's no one to ask. The burn-in repro was stopping node-b/c briefly; node-a evicted both, then never rejoined until manual restart.
Relationship to #87
The F7 fix in #87 already covered the non-zero-peer slice of this issue: pre-#87 self leaked into the peer map, so
len(p.peers)overstated by 1 andperformTopologySync's threshold check was always satisfied even when the node was missing real peers — sync was permanently skipped. Post-#87 that path works.What's left is the zero-real-peer case: even with honest counts, topology sync has no peer to broadcast to once the local set is empty. This PR adds the recovery path for that case.
Design (Go + TS, mirrored)
ClusterNoderuns an isolation-recovery loop on a 30s ticker (matches the gossip health-check cadence — recovery starts within one tick of full eviction).removePeer: multi-eviction in a singlepingPeerstick can fire several removals back-to-back. Triggering from each would either spawn duplicate recoveries or require single-flight machinery. Polling is simpler and correct.CheckIsolationAndRecover(Go) /checkIsolationAndRecover(TS) so tests can drive recovery deterministically without timers.What changed
internal/cluster/node.goseedProviderfield,SetSeedProvider,runIsolationRecoverygoroutine,CheckIsolationAndRecovermethod,IsolationRecoveryIntervalconstantcmd/repram/main.gorefresher.Current().Nodes, private snapshotsbootstrapNodesrepram-mcp/src/node/cluster.tsrebootstrapFnfield,setRebootstrapFn, isolation timer instart(),checkIsolationAndRecover,ISOLATION_RECOVERY_INTERVAL_MSconstantrepram-mcp/src/index.tsrefresher.currentList.nodes, private snapshotsseedPeersTests
internal/cluster/recovery_test.go(+4): happy-path recovery, no-op when peers exist, no-op without seedProvider, no-op on empty seeds.repram-mcp/src/node/cluster.test.ts(+5): same coverage plus an error-swallowing test for a re-bootstrap that throws.Suite results: Go all packages green; TS 369/369 (was 364, +5 new).
Out of scope (follow-ups)
Test plan
./test/live-wire-compat/run.sh) passes — adds confidence that the timer-driven path doesn't interfere with normal operationCloses
Closes #85
// ticktockbent