fix(tree): F108 reattach transient nodes after ungraceful disconnect by TickTockBent · Pull Request #110 · TickTockBent/REPRAM

TickTockBent · 2026-04-27T18:00:20Z

Summary

Closes #108. Pre-fix, a transient TS node lost its substrate parent permanently if the WebSocket closed ungracefully (TCP drop, crash, NAT rebind). Only the graceful goodbye path triggered reattachment; the ungraceful close handler nulled parentConnection, logged a warning referencing nonexistent "heartbeat-based reattachment," and left the node operating degraded until restart. On residential networks where ungraceful disconnects are routine (parent restarts that don't go through clean shutdown, NAT rebinds), this made the transient role unreliable.

Design (walked through pre-implementation)

Unified close handler. Both graceful (goodbye-then-close) and ungraceful paths flow through triggerReattach. Graceful supplies the goodbye-payload alternatives; ungraceful supplies undefined and the loop falls back to cached topology then the seed list.

Three-layer alternatives lookup, attempted once per cycle:

#	Source	Per-attempt timeout	Total layer cap	Notes
1	Goodbye-payload alternatives	10s	n/a	First iteration only if present
2	Cached `welcome.topology` from attach time	5s	30s	Stale by definition; fast-fail on TCP refused, cap on blackhole
3	Seed list (`setSeedProvider`)	10s	n/a	Public: omega refresher's currentList; private: REPRAM_PEERS snapshot

The cached-layer per-attempt timeout is shorter than the seed layer because cached entries are speculative — a substrate that left the cluster fails fast on TCP refused. The 30s total cap bounds the worst case (blackhole holes in a fully-stale cache) so the loop reaches fresh seeds without minutes of waste.

Exponential backoff between full cycles: 5s → 10s → 30s → 60s, capped. Retries forever (until success or stop()) — same philosophy as #85's isolation recovery. The transient stays alive and keeps trying; operators detect "stuck retrying" via metrics rather than detect-and-restart.

Concurrency. Single-flight via the existing reattaching flag. stop()-aware:

stopping flag short-circuits triggerReattach
stopSignal Promise races against backoff sleep so shutdown wakes the loop within milliseconds rather than at end-of-backoff
Identity-aware close/goodbye handlers (if (this.parentConnection !== conn) return) don't clobber a new parent when a stale connection's events fire after a successful reattach

Tests (+5)

captures welcome.topology as cached alternatives, excluding self
parseSeedAddress handles host:port and rejects malformed input
close handler on a stale connection does not clobber the active parent (identity check)
ungraceful close on the active parent triggers the reattach loop (verified via observable seedProvider call)
stop() sets stopping flag and resolves pending sleeps promptly (races stopSignal against a 60s sleep)

Suite results: TS 376/376 (was 371, +5). Go unchanged.

Out of scope (acknowledged)

Refreshing lastKnownAlternatives during attach. The cached topology is captured at attach time and never refreshed. Overlaps with Attachment selection — scoring algorithm and periodic reassessment #67/Attachment migration — overlap, switch, hysteresis #68 (Phase 3 attachment migration) which is the place to address dynamic topology awareness. Deferred deliberately.
Parallel cached-layer attempts via Promise.race. Sequential walk with fail-fast timeout + 30s layer cap is sufficient. Adding parallel attempts would require abort-signal support in connectToSubstrate. Filed mentally as future optimization if cache-staleness becomes a real operational issue.
Metrics. Reattach counts and time-to-recover would help operators detect stuck transients. Separate concern; could land alongside Add public network dashboard #16 (public dashboard).

Test plan

CI green
Manual: spin up a substrate + transient locally; SIGKILL the substrate; verify the transient logs reattach attempts and recovers when the substrate restarts
Verify the existing graceful-goodbye path still works (covered by existing reattachment on goodbye test)

Closes

Closes #108

// ticktockbent

Closes #108. Pre-fix, a transient TS node lost its substrate parent permanently if the WebSocket closed ungracefully (TCP drop, crash, NAT rebind). Only the graceful goodbye path triggered reattachment; the ungraceful close handler nulled parentConnection, logged a warning referencing nonexistent "heartbeat-based reattachment," and left the node operating degraded until restart. On residential networks where ungraceful disconnects are common, this made the transient role unreliable in practice. Design (#108 walked through in PR description): Unified close handler. Both graceful (goodbye-then-close) and ungraceful paths now flow through triggerReattach. Graceful supplies the goodbye-payload alternatives; ungraceful supplies undefined and the loop falls back to cached topology then the seed list. Three-layer alternatives lookup, attempted once per cycle: 1. Goodbye-supplied alternatives (only on first iteration if present). 2. Cached welcome.topology snapshot from attach time. Per-attempt timeout 5s (cached entries are speculative — a substrate that has left the cluster fails fast on TCP refused). Total layer deadline 30s so a fully-stale cache cannot waste minutes before falling through to the seed list. 3. Seed list from setSeedProvider. Per-attempt 10s. Wired in index.ts to omega refresher's currentList (public) or REPRAM_PEERS snapshot (private), mirroring the seedProvider pattern from #85. Exponential backoff between full cycles: 5s → 10s → 30s → 60s, capped. Retries forever (until success or stop()) — same philosophy as #85's isolation recovery loop. The transient stays alive and keeps trying; metrics surface "stuck retrying for hours" rather than requiring operator detect-and-restart. Single-flight via existing reattaching flag. stop()-aware: - stopping flag short-circuits triggerReattach - stopSignal Promise races against backoff sleep so shutdown wakes the loop within milliseconds rather than at end-of-backoff - Identity-aware close/goodbye handlers don't clobber a new parent when a stale connection's events fire after a successful reattach Tests (repram-mcp/src/node/tree.test.ts, +5): - captures welcome.topology as cached alternatives (excludes self) - parseSeedAddress handles host:port and rejects malformed input - close handler on stale conn doesn't clobber active parent (identity) - ungraceful close on active parent triggers the reattach loop (verified via observable seedProvider call) - stop() sets stopping flag and resolves pending sleeps promptly Suite results: TS: 376/376 (was 371, +5 new) Go: unchanged (no Go side touched) Out of scope (acknowledged in PR body): - Refreshing lastKnownAlternatives during attach. Overlaps with #67/#68 (Phase 3 attachment migration); deferred. - Parallel cached-layer attempts via Promise.race. Sequential with fail-fast timeout + layer-total cap is sufficient. - Metrics for reattach counts / time-to-recover. Closes #108 // ticktockbent

vercel · 2026-04-27T18:00:26Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
repram	Ready	Preview, Comment	Apr 27, 2026 6:23pm

…ck comment Addresses suggestions 1 and 2 from PR #110 cold review. Suggestion 2 — silent-termination race in tryAlternatives: Sequence: attach(connB) succeeds (welcome received, parentConnection set to connB, close handler installed). Before tryAlternatives can call onReattachCallback and return true, connB drops. Its close handler fires → nulls parentConnection → calls triggerReattach, which sees `reattaching=true` (we still hold the flag) and returns. The outer loop then sees `welcome` non-null and returns true, marking the cycle a success and exiting. Node is now parentless with no further reattach scheduled. Fix: after attach() returns welcome, check `parentConnection === conn && !conn.isClosed` before declaring success. If the conn died, log and continue to the next alternative. This keeps the outer loop iterating so a fresh cycle eventually runs and reattaches. Suggestion 1 — comment clarity on the close-handler identity check: The reviewer noted the `!== conn` guard handles two cases, only one of which the original comment described. Updated to explicitly cover both: (1) stale handler from a previously-replaced conn, and (2) post-goodbye close where parentConnection was already nulled by the goodbye handler. The single-flight flag would have blocked the duplicate triggerReattach in case (2) anyway, but the early return avoids the spurious entry. Skipped: - Suggestion 3 (CACHED_LAYER_DEADLINE_MS comment): cosmetic. - Suggestion 4 (parseSeedAddress vs index.ts split-on-first-colon inconsistency): pre-existing scope creep. // ticktockbent

TickTockBent · 2026-04-27T18:23:50Z

Update — review fixes pushed (`0bbac84`)

Suggestion 2 (real race fix): `tryAlternatives` now checks `parentConnection === conn && !conn.isClosed` after `attach()` returns welcome but before declaring success. Pre-fix, a connB that dropped between welcome receipt and the outer loop's success path would silently terminate the cycle (close handler tries `triggerReattach` but is blocked by the single-flight flag we still hold). Post-fix, that case is detected and the loop continues to the next alternative. Hard to test deterministically without injecting a closed conn mid-attach; the fix is small and the logic is straightforward to inspect.
Suggestion 1 (comment): updated the close-handler comment to explicitly cover both cases the `!== conn` guard catches: stale handler from a previously-replaced conn, AND post-goodbye close where parentConnection was already nulled by the goodbye handler.
Skipped:
- Suggestion 3 (`CACHED_LAYER_DEADLINE_MS + CACHED_ALT_CONNECT_TIMEOUT_MS` worst-case comment) — cosmetic.
- Suggestion 4 (`parseSeedAddress` vs `index.ts` IPv6-handling inconsistency) — pre-existing, scope creep here.
- Nits — comment/test assertions, layer-2 deadline test, port validation tweak. Worth tracking if anyone wants them but not in this PR.

Suite results: TS 376/376 (no test count change; the race-guard fix is a behavioral guard, not directly tested without injection machinery).

vercel Bot deployed to Preview April 27, 2026 18:23 View deployment

TickTockBent merged commit e4c050a into main Apr 27, 2026
4 checks passed

TickTockBent deleted the fix/108-ts-ungraceful-reattach branch April 27, 2026 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tree): F108 reattach transient nodes after ungraceful disconnect#110

fix(tree): F108 reattach transient nodes after ungraceful disconnect#110
TickTockBent merged 2 commits intomainfrom
fix/108-ts-ungraceful-reattach

TickTockBent commented Apr 27, 2026

Uh oh!

vercel Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

TickTockBent commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TickTockBent commented Apr 27, 2026

Summary

Design (walked through pre-implementation)

Tests (+5)

Out of scope (acknowledged)

Test plan

Closes

Uh oh!

vercel Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TickTockBent commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Apr 27, 2026 •

edited

Loading