fix(tree): F108 reattach transient nodes after ungraceful disconnect#110
fix(tree): F108 reattach transient nodes after ungraceful disconnect#110TickTockBent merged 2 commits intomainfrom
Conversation
Closes #108. Pre-fix, a transient TS node lost its substrate parent permanently if the WebSocket closed ungracefully (TCP drop, crash, NAT rebind). Only the graceful goodbye path triggered reattachment; the ungraceful close handler nulled parentConnection, logged a warning referencing nonexistent "heartbeat-based reattachment," and left the node operating degraded until restart. On residential networks where ungraceful disconnects are common, this made the transient role unreliable in practice. Design (#108 walked through in PR description): Unified close handler. Both graceful (goodbye-then-close) and ungraceful paths now flow through triggerReattach. Graceful supplies the goodbye-payload alternatives; ungraceful supplies undefined and the loop falls back to cached topology then the seed list. Three-layer alternatives lookup, attempted once per cycle: 1. Goodbye-supplied alternatives (only on first iteration if present). 2. Cached welcome.topology snapshot from attach time. Per-attempt timeout 5s (cached entries are speculative — a substrate that has left the cluster fails fast on TCP refused). Total layer deadline 30s so a fully-stale cache cannot waste minutes before falling through to the seed list. 3. Seed list from setSeedProvider. Per-attempt 10s. Wired in index.ts to omega refresher's currentList (public) or REPRAM_PEERS snapshot (private), mirroring the seedProvider pattern from #85. Exponential backoff between full cycles: 5s → 10s → 30s → 60s, capped. Retries forever (until success or stop()) — same philosophy as #85's isolation recovery loop. The transient stays alive and keeps trying; metrics surface "stuck retrying for hours" rather than requiring operator detect-and-restart. Single-flight via existing reattaching flag. stop()-aware: - stopping flag short-circuits triggerReattach - stopSignal Promise races against backoff sleep so shutdown wakes the loop within milliseconds rather than at end-of-backoff - Identity-aware close/goodbye handlers don't clobber a new parent when a stale connection's events fire after a successful reattach Tests (repram-mcp/src/node/tree.test.ts, +5): - captures welcome.topology as cached alternatives (excludes self) - parseSeedAddress handles host:port and rejects malformed input - close handler on stale conn doesn't clobber active parent (identity) - ungraceful close on active parent triggers the reattach loop (verified via observable seedProvider call) - stop() sets stopping flag and resolves pending sleeps promptly Suite results: TS: 376/376 (was 371, +5 new) Go: unchanged (no Go side touched) Out of scope (acknowledged in PR body): - Refreshing lastKnownAlternatives during attach. Overlaps with #67/#68 (Phase 3 attachment migration); deferred. - Parallel cached-layer attempts via Promise.race. Sequential with fail-fast timeout + layer-total cap is sufficient. - Metrics for reattach counts / time-to-recover. Closes #108 // ticktockbent
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…ck comment Addresses suggestions 1 and 2 from PR #110 cold review. Suggestion 2 — silent-termination race in tryAlternatives: Sequence: attach(connB) succeeds (welcome received, parentConnection set to connB, close handler installed). Before tryAlternatives can call onReattachCallback and return true, connB drops. Its close handler fires → nulls parentConnection → calls triggerReattach, which sees `reattaching=true` (we still hold the flag) and returns. The outer loop then sees `welcome` non-null and returns true, marking the cycle a success and exiting. Node is now parentless with no further reattach scheduled. Fix: after attach() returns welcome, check `parentConnection === conn && !conn.isClosed` before declaring success. If the conn died, log and continue to the next alternative. This keeps the outer loop iterating so a fresh cycle eventually runs and reattaches. Suggestion 1 — comment clarity on the close-handler identity check: The reviewer noted the `!== conn` guard handles two cases, only one of which the original comment described. Updated to explicitly cover both: (1) stale handler from a previously-replaced conn, and (2) post-goodbye close where parentConnection was already nulled by the goodbye handler. The single-flight flag would have blocked the duplicate triggerReattach in case (2) anyway, but the early return avoids the spurious entry. Skipped: - Suggestion 3 (CACHED_LAYER_DEADLINE_MS comment): cosmetic. - Suggestion 4 (parseSeedAddress vs index.ts split-on-first-colon inconsistency): pre-existing scope creep. // ticktockbent
|
Update — review fixes pushed (`0bbac84`)
Suite results: TS 376/376 (no test count change; the race-guard fix is a behavioral guard, not directly tested without injection machinery). |
Summary
Closes #108. Pre-fix, a transient TS node lost its substrate parent permanently if the WebSocket closed ungracefully (TCP drop, crash, NAT rebind). Only the graceful goodbye path triggered reattachment; the ungraceful close handler nulled
parentConnection, logged a warning referencing nonexistent "heartbeat-based reattachment," and left the node operating degraded until restart. On residential networks where ungraceful disconnects are routine (parent restarts that don't go through clean shutdown, NAT rebinds), this made the transient role unreliable.Design (walked through pre-implementation)
Unified close handler. Both graceful (goodbye-then-close) and ungraceful paths flow through
triggerReattach. Graceful supplies the goodbye-payload alternatives; ungraceful supplies undefined and the loop falls back to cached topology then the seed list.Three-layer alternatives lookup, attempted once per cycle:
welcome.topologyfrom attach timesetSeedProvider)The cached-layer per-attempt timeout is shorter than the seed layer because cached entries are speculative — a substrate that left the cluster fails fast on TCP refused. The 30s total cap bounds the worst case (blackhole holes in a fully-stale cache) so the loop reaches fresh seeds without minutes of waste.
Exponential backoff between full cycles: 5s → 10s → 30s → 60s, capped. Retries forever (until success or
stop()) — same philosophy as #85's isolation recovery. The transient stays alive and keeps trying; operators detect "stuck retrying" via metrics rather than detect-and-restart.Concurrency. Single-flight via the existing
reattachingflag.stop()-aware:stoppingflag short-circuitstriggerReattachstopSignalPromise races against backoff sleep so shutdown wakes the loop within milliseconds rather than at end-of-backoffif (this.parentConnection !== conn) return) don't clobber a new parent when a stale connection's events fire after a successful reattachTests (+5)
captures welcome.topology as cached alternatives, excluding selfparseSeedAddress handles host:port and rejects malformed inputclose handler on a stale connection does not clobber the active parent(identity check)ungraceful close on the active parent triggers the reattach loop(verified via observable seedProvider call)stop() sets stopping flag and resolves pending sleeps promptly(races stopSignal against a 60s sleep)Suite results: TS 376/376 (was 371, +5). Go unchanged.
Out of scope (acknowledged)
lastKnownAlternativesduring attach. The cached topology is captured at attach time and never refreshed. Overlaps with Attachment selection — scoring algorithm and periodic reassessment #67/Attachment migration — overlap, switch, hysteresis #68 (Phase 3 attachment migration) which is the place to address dynamic topology awareness. Deferred deliberately.Promise.race. Sequential walk with fail-fast timeout + 30s layer cap is sufficient. Adding parallel attempts would require abort-signal support inconnectToSubstrate. Filed mentally as future optimization if cache-staleness becomes a real operational issue.Test plan
reattachment on goodbyetest)Closes
Closes #108
// ticktockbent