Skip to content

fix(tree): F108 reattach transient nodes after ungraceful disconnect#110

Merged
TickTockBent merged 2 commits intomainfrom
fix/108-ts-ungraceful-reattach
Apr 27, 2026
Merged

fix(tree): F108 reattach transient nodes after ungraceful disconnect#110
TickTockBent merged 2 commits intomainfrom
fix/108-ts-ungraceful-reattach

Conversation

@TickTockBent
Copy link
Copy Markdown
Owner

Summary

Closes #108. Pre-fix, a transient TS node lost its substrate parent permanently if the WebSocket closed ungracefully (TCP drop, crash, NAT rebind). Only the graceful goodbye path triggered reattachment; the ungraceful close handler nulled parentConnection, logged a warning referencing nonexistent "heartbeat-based reattachment," and left the node operating degraded until restart. On residential networks where ungraceful disconnects are routine (parent restarts that don't go through clean shutdown, NAT rebinds), this made the transient role unreliable.

Design (walked through pre-implementation)

Unified close handler. Both graceful (goodbye-then-close) and ungraceful paths flow through triggerReattach. Graceful supplies the goodbye-payload alternatives; ungraceful supplies undefined and the loop falls back to cached topology then the seed list.

Three-layer alternatives lookup, attempted once per cycle:

# Source Per-attempt timeout Total layer cap Notes
1 Goodbye-payload alternatives 10s n/a First iteration only if present
2 Cached welcome.topology from attach time 5s 30s Stale by definition; fast-fail on TCP refused, cap on blackhole
3 Seed list (setSeedProvider) 10s n/a Public: omega refresher's currentList; private: REPRAM_PEERS snapshot

The cached-layer per-attempt timeout is shorter than the seed layer because cached entries are speculative — a substrate that left the cluster fails fast on TCP refused. The 30s total cap bounds the worst case (blackhole holes in a fully-stale cache) so the loop reaches fresh seeds without minutes of waste.

Exponential backoff between full cycles: 5s → 10s → 30s → 60s, capped. Retries forever (until success or stop()) — same philosophy as #85's isolation recovery. The transient stays alive and keeps trying; operators detect "stuck retrying" via metrics rather than detect-and-restart.

Concurrency. Single-flight via the existing reattaching flag. stop()-aware:

  • stopping flag short-circuits triggerReattach
  • stopSignal Promise races against backoff sleep so shutdown wakes the loop within milliseconds rather than at end-of-backoff
  • Identity-aware close/goodbye handlers (if (this.parentConnection !== conn) return) don't clobber a new parent when a stale connection's events fire after a successful reattach

Tests (+5)

  • captures welcome.topology as cached alternatives, excluding self
  • parseSeedAddress handles host:port and rejects malformed input
  • close handler on a stale connection does not clobber the active parent (identity check)
  • ungraceful close on the active parent triggers the reattach loop (verified via observable seedProvider call)
  • stop() sets stopping flag and resolves pending sleeps promptly (races stopSignal against a 60s sleep)

Suite results: TS 376/376 (was 371, +5). Go unchanged.

Out of scope (acknowledged)

  • Refreshing lastKnownAlternatives during attach. The cached topology is captured at attach time and never refreshed. Overlaps with Attachment selection — scoring algorithm and periodic reassessment #67/Attachment migration — overlap, switch, hysteresis #68 (Phase 3 attachment migration) which is the place to address dynamic topology awareness. Deferred deliberately.
  • Parallel cached-layer attempts via Promise.race. Sequential walk with fail-fast timeout + 30s layer cap is sufficient. Adding parallel attempts would require abort-signal support in connectToSubstrate. Filed mentally as future optimization if cache-staleness becomes a real operational issue.
  • Metrics. Reattach counts and time-to-recover would help operators detect stuck transients. Separate concern; could land alongside Add public network dashboard #16 (public dashboard).

Test plan

  • CI green
  • Manual: spin up a substrate + transient locally; SIGKILL the substrate; verify the transient logs reattach attempts and recovers when the substrate restarts
  • Verify the existing graceful-goodbye path still works (covered by existing reattachment on goodbye test)

Closes

Closes #108

// ticktockbent

Closes #108. Pre-fix, a transient TS node lost its substrate parent
permanently if the WebSocket closed ungracefully (TCP drop, crash, NAT
rebind). Only the graceful goodbye path triggered reattachment; the
ungraceful close handler nulled parentConnection, logged a warning
referencing nonexistent "heartbeat-based reattachment," and left the
node operating degraded until restart. On residential networks where
ungraceful disconnects are common, this made the transient role
unreliable in practice.

Design (#108 walked through in PR description):

Unified close handler. Both graceful (goodbye-then-close) and ungraceful
paths now flow through triggerReattach. Graceful supplies the
goodbye-payload alternatives; ungraceful supplies undefined and the
loop falls back to cached topology then the seed list.

Three-layer alternatives lookup, attempted once per cycle:
  1. Goodbye-supplied alternatives (only on first iteration if present).
  2. Cached welcome.topology snapshot from attach time. Per-attempt
     timeout 5s (cached entries are speculative — a substrate that has
     left the cluster fails fast on TCP refused). Total layer deadline
     30s so a fully-stale cache cannot waste minutes before falling
     through to the seed list.
  3. Seed list from setSeedProvider. Per-attempt 10s. Wired in index.ts
     to omega refresher's currentList (public) or REPRAM_PEERS snapshot
     (private), mirroring the seedProvider pattern from #85.

Exponential backoff between full cycles: 5s → 10s → 30s → 60s, capped.
Retries forever (until success or stop()) — same philosophy as #85's
isolation recovery loop. The transient stays alive and keeps trying;
metrics surface "stuck retrying for hours" rather than requiring
operator detect-and-restart.

Single-flight via existing reattaching flag. stop()-aware:
  - stopping flag short-circuits triggerReattach
  - stopSignal Promise races against backoff sleep so shutdown wakes
    the loop within milliseconds rather than at end-of-backoff
  - Identity-aware close/goodbye handlers don't clobber a new parent
    when a stale connection's events fire after a successful reattach

Tests (repram-mcp/src/node/tree.test.ts, +5):
  - captures welcome.topology as cached alternatives (excludes self)
  - parseSeedAddress handles host:port and rejects malformed input
  - close handler on stale conn doesn't clobber active parent (identity)
  - ungraceful close on active parent triggers the reattach loop
    (verified via observable seedProvider call)
  - stop() sets stopping flag and resolves pending sleeps promptly

Suite results:
  TS: 376/376 (was 371, +5 new)
  Go: unchanged (no Go side touched)

Out of scope (acknowledged in PR body):
  - Refreshing lastKnownAlternatives during attach. Overlaps with
    #67/#68 (Phase 3 attachment migration); deferred.
  - Parallel cached-layer attempts via Promise.race. Sequential with
    fail-fast timeout + layer-total cap is sufficient.
  - Metrics for reattach counts / time-to-recover.

Closes #108

// ticktockbent
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
repram Ready Ready Preview, Comment Apr 27, 2026 6:23pm

Request Review

…ck comment

Addresses suggestions 1 and 2 from PR #110 cold review.

Suggestion 2 — silent-termination race in tryAlternatives:
  Sequence: attach(connB) succeeds (welcome received, parentConnection
  set to connB, close handler installed). Before tryAlternatives can
  call onReattachCallback and return true, connB drops. Its close
  handler fires → nulls parentConnection → calls triggerReattach,
  which sees `reattaching=true` (we still hold the flag) and returns.
  The outer loop then sees `welcome` non-null and returns true, marking
  the cycle a success and exiting. Node is now parentless with no
  further reattach scheduled.

  Fix: after attach() returns welcome, check `parentConnection ===
  conn && !conn.isClosed` before declaring success. If the conn died,
  log and continue to the next alternative. This keeps the outer loop
  iterating so a fresh cycle eventually runs and reattaches.

Suggestion 1 — comment clarity on the close-handler identity check:
  The reviewer noted the `!== conn` guard handles two cases, only one
  of which the original comment described. Updated to explicitly cover
  both: (1) stale handler from a previously-replaced conn, and (2)
  post-goodbye close where parentConnection was already nulled by the
  goodbye handler. The single-flight flag would have blocked the
  duplicate triggerReattach in case (2) anyway, but the early return
  avoids the spurious entry.

Skipped:
  - Suggestion 3 (CACHED_LAYER_DEADLINE_MS comment): cosmetic.
  - Suggestion 4 (parseSeedAddress vs index.ts split-on-first-colon
    inconsistency): pre-existing scope creep.

// ticktockbent
@TickTockBent
Copy link
Copy Markdown
Owner Author

Update — review fixes pushed (`0bbac84`)

  • Suggestion 2 (real race fix): `tryAlternatives` now checks `parentConnection === conn && !conn.isClosed` after `attach()` returns welcome but before declaring success. Pre-fix, a connB that dropped between welcome receipt and the outer loop's success path would silently terminate the cycle (close handler tries `triggerReattach` but is blocked by the single-flight flag we still hold). Post-fix, that case is detected and the loop continues to the next alternative. Hard to test deterministically without injecting a closed conn mid-attach; the fix is small and the logic is straightforward to inspect.

  • Suggestion 1 (comment): updated the close-handler comment to explicitly cover both cases the `!== conn` guard catches: stale handler from a previously-replaced conn, AND post-goodbye close where parentConnection was already nulled by the goodbye handler.

  • Skipped:

    • Suggestion 3 (`CACHED_LAYER_DEADLINE_MS + CACHED_ALT_CONNECT_TIMEOUT_MS` worst-case comment) — cosmetic.
    • Suggestion 4 (`parseSeedAddress` vs `index.ts` IPv6-handling inconsistency) — pre-existing, scope creep here.
    • Nits — comment/test assertions, layer-2 deadline test, port validation tweak. Worth tracking if anyone wants them but not in this PR.

Suite results: TS 376/376 (no test count change; the race-guard fix is a behavioral guard, not directly tested without injection machinery).

@TickTockBent TickTockBent merged commit e4c050a into main Apr 27, 2026
4 checks passed
@TickTockBent TickTockBent deleted the fix/108-ts-ungraceful-reattach branch April 27, 2026 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TS transient nodes don't reattach after ungraceful parent disconnect (TreeManager gap)

1 participant