Skip to content

fix: peer backoff reconnect to prevent restart TLC stuck#1166

Closed
jjyr wants to merge 10 commits intonervosnetwork:developfrom
jjyr:pr/ring-restart-clean
Closed

fix: peer backoff reconnect to prevent restart TLC stuck#1166
jjyr wants to merge 10 commits intonervosnetwork:developfrom
jjyr:pr/ring-restart-clean

Conversation

@jjyr
Copy link
Copy Markdown
Collaborator

@jjyr jjyr commented Mar 3, 2026

Blocked by #1111

Scope

This PR is based on repro-restart-upstream-develop and only includes commits after that base branch.

Root Cause of TLC Stuck

The primary issue is a reconnect gap after peer disconnect/restart:

  • Channel actors are stopped on disconnect.
  • Reestablishment can only start after peer reconnect + Init exchange.
  • Previously, reconnect was not deterministic in key paths (especially disconnect and startup dial failure windows).
  • During this missing-actor window, TLC replay/remove flows can be delayed or blocked, and waiting_ack can remain gated for a long time, which manifests as stuck TLCs.

In short: the first failure point is reconnect not becoming ready in time, not the core TLC state machine itself.

Fix Implemented

  1. Added deterministic peer reconnect with exponential backoff in NetworkActor.
  2. Backoff is seeded when:
  • peer disconnect is observed, or
  • DialerError happens.
  1. Backoff is only enabled when the peer still has direct active channels.
  2. Guardrails:
  • skip reconnect for user-requested disconnect,
  • skip reconnect when there is no direct active channel.
  1. Backoff state is cleared after successful peer reconnect.
  2. Kept minimal debug events for reconnect lifecycle observability.

** Update 3.11 **

A deferred RemoveTlcFulfill could lose its preimage from persistent storage during re-
establishment. That broke force-close settlement and caused the E2E balance mismatch.
The backoff reconnect mechanism hit this situation accidently.
We fixed it by storing the preimage before deferring the fulfill remove, and added a regression
test for it.

Test Improvements

  • Added targeted reconnect behavior tests:
  • test_peer_disconnect_with_active_channel_enters_backoff_reconnect
  • test_startup_dial_error_with_active_channel_enters_backoff_reconnect
  • test_peer_disconnect_without_active_channel_skips_backoff_reconnect
  • Updated the ring restart repro test to replace fixed sleep with state-based wait:
  • proceed immediately once conditions are met,
  • keep a 120s timeout ceiling for diagnostics.

Validation

Verified with:

  • cargo check -p fnn --tests --quiet
  • the 3 reconnect tests above
  • test_ring_self_payments_then_restart_two_nodes (--run-ignored ignored-only)

GitHub Diff Link (base vs PR branch)
jjyr/fiber@repro-restart-upstream-develop...pr/ring-restart-clean

@jjyr jjyr force-pushed the pr/ring-restart-clean branch from a4a25e2 to db3c067 Compare March 4, 2026 07:08
@quake quake requested a review from Copilot March 4, 2026 07:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent “TLC stuck” during peer disconnect/restart by making reconnect deterministic (exponential backoff) and by improving reestablish replay determinism via persisted CommitDiff.

Changes:

  • Add peer reconnect backoff scheduling/guardrails in NetworkActor (seeded on disconnect and dial errors).
  • Persist and replay pending commitment state (CommitDiff) during channel reestablish, including deterministic replay ordering and deferred peer TLC update handling.
  • Expand test coverage with targeted reconnect and commit-diff tests, plus additional restart/reestablish scenarios.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
crates/fiber-lib/src/fiber/network.rs Implements reconnect-backoff state machine and triggers on disconnect/dial errors.
crates/fiber-lib/src/fiber/channel.rs Persists CommitDiff, replays it on reestablish, and adds deferred peer TLC update handling.
crates/fiber-lib/src/store/store_impl/mod.rs Adds KV persistence for pending CommitDiff.
crates/fiber-lib/src/store/schema.rs Introduces DB prefix for PendingCommitDiff.
crates/fiber-lib/src/fiber/tests/channel_commit_diff.rs New unit tests for CommitDiff validation/ordering helpers.
crates/fiber-lib/src/fiber/tests/channel.rs Adds reconnect-backoff tests and multiple restart/reestablish scenarios (incl. ignored ring repro).
crates/fiber-lib/src/fiber/tests/network.rs Adds test ensuring reconnect-backoff is skipped without direct active channels.
crates/fiber-lib/src/store/tests/store.rs Updates fixtures for newly added ChannelActorState fields.
crates/fiber-lib/src/store/sample/sample_channel.rs Updates sample ChannelActorState builders for new fields.
crates/fiber-lib/src/fiber/tests/settle_tlc_set_command_tests.rs Updates mock store + state builders for new CommitDiff/state fields.
crates/fiber-lib/src/fiber/tests/mod.rs Registers new channel_commit_diff test module.

Comment thread crates/fiber-lib/src/fiber/network.rs Outdated
Comment thread crates/fiber-lib/src/fiber/network.rs
Comment thread crates/fiber-lib/src/fiber/tests/network.rs
Comment thread crates/fiber-lib/src/fiber/tests/channel.rs
Comment thread crates/fiber-lib/src/fiber/tests/channel.rs
Comment thread crates/fiber-lib/src/fiber/tests/channel.rs
Comment thread crates/fiber-lib/src/fiber/tests/channel.rs
Comment thread crates/fiber-lib/src/fiber/tests/channel.rs
Comment thread crates/fiber-lib/src/fiber/tests/channel.rs
@jjyr jjyr force-pushed the pr/ring-restart-clean branch 2 times, most recently from d6ae16c to 2b9579c Compare March 6, 2026 09:31
@jjyr jjyr force-pushed the pr/ring-restart-clean branch from 1dd0756 to 10ab5ae Compare March 11, 2026 01:49
@quake quake added this to the v0.8 milestone Mar 16, 2026
Comment on lines +3017 to +3018
peer_reconnect_backoff_attempts: HashMap<PeerId, u32>,
requested_disconnect_peers: HashSet<PeerId>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two fields can be changed to use Pubkey, it may simplify the pubkey <=> peerId conversion.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #1217

@jjyr
Copy link
Copy Markdown
Collaborator Author

jjyr commented Mar 18, 2026

Replaced by #1217

@jjyr jjyr closed this Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants