fix(cluster): F5 isolation recovery — re-bootstrap when peer count hits 0 by TickTockBent · Pull Request #107 · TickTockBent/REPRAM

TickTockBent · 2026-04-27T14:53:00Z

Summary

Closes #85. A node that loses every peer had no in-band path back to the cluster — topology sync exchanges peer lists with known peers, so an empty peer set means there's no one to ask. The burn-in repro was stopping node-b/c briefly; node-a evicted both, then never rejoined until manual restart.

Relationship to #87

The F7 fix in #87 already covered the non-zero-peer slice of this issue: pre-#87 self leaked into the peer map, so len(p.peers) overstated by 1 and performTopologySync's threshold check was always satisfied even when the node was missing real peers — sync was permanently skipped. Post-#87 that path works.

What's left is the zero-real-peer case: even with honest counts, topology sync has no peer to broadcast to once the local set is empty. This PR adds the recovery path for that case.

Design (Go + TS, mirrored)

ClusterNode runs an isolation-recovery loop on a 30s ticker (matches the gossip health-check cadence — recovery starts within one tick of full eviction).
Each tick: if peer count is 0 and a seedProvider (Go) or rebootstrapFn (TS) is wired, re-bootstrap.
Polling rather than event-driven from removePeer: multi-eviction in a single pingPeers tick can fire several removals back-to-back. Triggering from each would either spawn duplicate recoveries or require single-flight machinery. Polling is simpler and correct.
Public network: re-bootstrap pulls from the omega refresher's current signed list. An isolated node picks up whatever roots are live now, not whatever it started with.
Private / REPRAM_PEERS: re-bootstrap reuses the static seed list the operator provided (captured by closure at startup).
The recovery method is extracted as CheckIsolationAndRecover (Go) / checkIsolationAndRecover (TS) so tests can drive recovery deterministically without timers.

What changed

File	Change
`internal/cluster/node.go`	`seedProvider` field, `SetSeedProvider`, `runIsolationRecovery` goroutine, `CheckIsolationAndRecover` method, `IsolationRecoveryInterval` constant
`cmd/repram/main.go`	Wire seedProvider — public closes over `refresher.Current().Nodes`, private snapshots `bootstrapNodes`
`repram-mcp/src/node/cluster.ts`	`rebootstrapFn` field, `setRebootstrapFn`, isolation timer in `start()`, `checkIsolationAndRecover`, `ISOLATION_RECOVERY_INTERVAL_MS` constant
`repram-mcp/src/index.ts`	Wire rebootstrapFn — public closes over `refresher.currentList.nodes`, private snapshots `seedPeers`

Tests

internal/cluster/recovery_test.go (+4): happy-path recovery, no-op when peers exist, no-op without seedProvider, no-op on empty seeds.
repram-mcp/src/node/cluster.test.ts (+5): same coverage plus an error-swallowing test for a re-bootstrap that throws.

Suite results: Go all packages green; TS 369/369 (was 364, +5 new).

Out of scope (follow-ups)

WebSocket reattachment: transient nodes that isolate need their WS attachment to recover too. That's handled separately by TreeManager's goodbye/redirect logic. This PR is HTTP gossip recovery only.
Threshold tuning: current trigger is hard-coded zero. The issue body floats "configurable threshold" — deferrable until someone needs it.
Backoff under sustained outage: the 30s polling interval doubles as the de-facto retry backoff. If the cluster is fully down, re-bootstrap fails every 30s indefinitely. No exponential backoff added — keeps the design simpler. Easy to add later if seeds become rate-limited under sustained-outage scenarios.

Test plan

CI green
Live wire-compat (./test/live-wire-compat/run.sh) passes — adds confidence that the timer-driven path doesn't interfere with normal operation
Manual repro of the burn-in scenario: spin up 3-node cluster, stop 2 nodes, verify the third recovers when the cluster comes back

Closes

Closes #85

// ticktockbent

…ts 0 Closes #85. A node that loses every peer (e.g., 3 consecutive ping failures eviction across all peers, or a brief network outage) had no in-band path back to the cluster: topology sync exchanges peer lists with *known* peers, so an empty peer set means there's no one to ask. The burn-in's repro was stopping node-b/c briefly; node-a evicted both and never rejoined until manual restart. Note: the F7 fix in #87 already partially helped here. Pre-#87, self leaked into the peer map, so `len(p.peers)` overstated by 1 and performTopologySync's threshold check was always satisfied — sync was permanently skipped on a node that had real peers but appeared "full" to itself. Post-#87, that case works correctly. What's left is the zero-real-peer case, which is what this PR addresses. Design (Go + TS, mirrored): - ClusterNode owns an isolation-recovery loop on a 30s ticker (matches the gossip health-check cadence). On each tick, if peer count is 0 and a seedProvider/rebootstrapFn is wired, re-bootstrap. - Polling rather than event-driven (from removePeer): multi-eviction in a single pingPeers tick can fire several removals back-to-back; triggering from each would either spawn duplicate recoveries or require single-flight machinery. Polling is simpler and correct. - Public network: re-bootstrap pulls from the omega refresher's current signed list. So an isolated node picks up whatever roots are live *now*, not whatever it started with. - Private / REPRAM_PEERS: re-bootstrap reuses the static seed list the operator provided. - Tests use an extracted CheckIsolationAndRecover (Go) / checkIsolationAndRecover (TS) so recovery can be driven deterministically without timers. Tests: internal/cluster/recovery_test.go (+4): happy path, no-op-with-peers, no-seed-provider, empty-seeds repram-mcp/src/node/cluster.test.ts (+5): same coverage plus an error-swallowing test for failed re-bootstrap Suite results: Go: all packages green TS: 369/369 (was 364, +5 new) Out-of-scope follow-ups: - WebSocket reattachment for transient nodes that isolate is handled elsewhere (TreeManager goodbye/redirect logic). This PR is HTTP gossip recovery only. - Threshold tuning: current trigger is hard-coded zero. The issue body floats "configurable threshold" as an option — deferrable. Closes #85 // ticktockbent

vercel · 2026-04-27T14:53:05Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
repram	Ready	Preview, Comment	Apr 27, 2026 3:25pm

Addresses suggestions 1, 3, 5 from PR #107 cold review. Suggestion 2 (TreeManager ungraceful-disconnect gap) is a pre-existing issue that the PR body misframed as covered; filed as #108 with full repro and proposed fix path so the misframing becomes truthful by reference. Suggestion 1 — single-flight on checkIsolationAndRecover (TS): setInterval fires every 30s; the async re-bootstrap can take longer on slow networks. Without a guard, a second tick can launch a second concurrent recovery and produce duplicate addPeer calls. Added a `recovering` flag + try/finally release. Go is naturally single-flight because the goroutine blocks on Bootstrap() before the next ticker fires, so no Go-side change. Suggestion 3 — TS test for the timer-driven path: Added two tests using vitest fake timers: - guards against overlapping recovery cycles (single-flight) - invokes recovery on the setInterval tick and stops on .stop() These exercise the dispatch path that was previously untested at unit level. Suggestion 5 — TS comment parity for private-seed snapshot: Mirrors the Go-side comment noting that the seed list captured at startup is not refreshed at runtime. Operators rotating seeds need to restart the node. Skipped: Suggestion 4 (currentList event-loop guarantee comment) — very minor. Suite results: TS: 371/371 (was 369, +2 new) // ticktockbent

TickTockBent · 2026-04-27T15:25:42Z

Update — review fixes pushed (`83e52d6`)

Suggestion 1 (single-flight, TS): added a `recovering` flag with try/finally release in `checkIsolationAndRecover`. Without it, a slow re-bootstrap could outlast the 30s timer tick and produce duplicate `addPeer` calls. Go is naturally single-flight (goroutine blocks on `Bootstrap()` before next ticker fires), so no Go change.
Suggestion 3 (TS timer test): two new tests using vitest fake timers — overlapping-recovery guard, plus the timer-fires-and-stops dispatch path.
Suggestion 5 (TS private-seed comment): mirrored the Go comment noting the seed list captured at startup isn't refreshed at runtime.

Suggestion 2 (TreeManager misframing): the PR body claimed "WebSocket reattachment is handled by TreeManager's goodbye/redirect logic" — only graceful path actually has that. Ungraceful disconnects (TCP drop, crash, NAT rebind) leave the transient stranded. Filed as #108 with full repro, impact analysis, and proposed fix. Treat the original "WS reattachment is handled" line in this PR's description as referring only to the graceful path; #108 captures the ungraceful gap as a separate, pre-existing issue.

Suggestion 4 (`currentList` event-loop comment): skipped per author judgment — very minor.

Suite results: TS 371/371 (was 369, +2 new). Go unchanged.

TickTockBent mentioned this pull request Apr 27, 2026

TS transient nodes don't reattach after ungraceful parent disconnect (TreeManager gap) #108

Closed

vercel Bot deployed to Preview April 27, 2026 15:25 View deployment

TickTockBent merged commit 3ca2fa9 into main Apr 27, 2026
4 checks passed

TickTockBent deleted the fix/85-rebootstrap-on-isolation branch April 27, 2026 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cluster): F5 isolation recovery — re-bootstrap when peer count hits 0#107

fix(cluster): F5 isolation recovery — re-bootstrap when peer count hits 0#107
TickTockBent merged 2 commits intomainfrom
fix/85-rebootstrap-on-isolation

TickTockBent commented Apr 27, 2026

Uh oh!

vercel Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

TickTockBent commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TickTockBent commented Apr 27, 2026

Summary

Relationship to #87

Design (Go + TS, mirrored)

What changed

Tests

Out of scope (follow-ups)

Test plan

Closes

Uh oh!

vercel Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TickTockBent commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Apr 27, 2026 •

edited

Loading