-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
fix: close stale relay connection on ErrConnAlreadyExists to recover … #5866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
fpenezic
wants to merge
5
commits into
netbirdio:main
Choose a base branch
from
fpenezic:fix/stale-relay-conn-recovery
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
46a5a7d
fix: close stale relay connection on ErrConnAlreadyExists to recover …
fpenezic 0f1fc4f
fix: route CloseConnByPeerKey to correct relay client for foreign ser…
fpenezic 159183b
docs: explain Guard invariant for ErrConnAlreadyExists branch
fpenezic 691331a
fix: gate ErrConnAlreadyExists close-and-retry behind stale flag
fpenezic c0b3e4e
fix: rotate ICE session ID on WG timeout in ICE mode
fpenezic File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
fd -t f "worker_relay.go" --exec head -100 {}Repository: netbirdio/netbird
Length of output: 2725
🏁 Script executed:
Repository: netbirdio/netbird
Length of output: 229
🏁 Script executed:
rg "ErrConnAlreadyExists" -A 3 -B 3Repository: netbirdio/netbird
Length of output: 1384
🏁 Script executed:
Repository: netbirdio/netbird
Length of output: 291
🏁 Script executed:
fd -path "*/relay/client/client.go"Repository: netbirdio/netbird
Length of output: 229
🏁 Script executed:
Repository: netbirdio/netbird
Length of output: 229
🏁 Script executed:
Repository: netbirdio/netbird
Length of output: 229
🏁 Script executed:
rg "func.*Manager.*OpenConn" -A 20Repository: netbirdio/netbird
Length of output: 1284
🏁 Script executed:
rg "CloseConnByPeerKey" -A 10 -B 3Repository: netbirdio/netbird
Length of output: 4079
🏁 Script executed:
rg "OnNewOffer" -B 5 -A 2Repository: netbirdio/netbird
Length of output: 2252
🏁 Script executed:
Repository: netbirdio/netbird
Length of output: 3608
🏁 Script executed:
rg "func.*onRelayConnectionIsReady" -A 15Repository: netbirdio/netbird
Length of output: 979
🏁 Script executed:
rg "type Handshaker" -A 30Repository: netbirdio/netbird
Length of output: 2553
🏁 Script executed:
rg "AddRelayListener" -B 5 -A 10Repository: netbirdio/netbird
Length of output: 3701
🏁 Script executed:
rg "relayListeners" -B 3 -A 10Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
rg "type AsyncOfferListener" -A 40Repository: netbirdio/netbird
Length of output: 2686
🏁 Script executed:
rg "NewAsyncOfferListener" -A 30Repository: netbirdio/netbird
Length of output: 5775
🏁 Script executed:
rg "OnRemoteOffer" -B 3 -A 10Repository: netbirdio/netbird
Length of output: 4594
🏁 Script executed:
Repository: netbirdio/netbird
Length of output: 1653
🏁 Script executed:
rg "CloseConnByPeerKey" --type go -B 2 -A 8Repository: netbirdio/netbird
Length of output: 3587
🏁 Script executed:
rg "onRelayClientDisconnected" -B 3 -A 15Repository: netbirdio/netbird
Length of output: 1652
🏁 Script executed:
rg "func.*onRelayConnectionIsReady" -B 2 -A 20Repository: netbirdio/netbird
Length of output: 1331
🏁 Script executed:
rg "relayLock" -B 2 -A 2Repository: netbirdio/netbird
Length of output: 1090
🏁 Script executed:
Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
rg "conns\[" shared/relay/client/client.go -B 2 -A 2Repository: netbirdio/netbird
Length of output: 1303
🏁 Script executed:
rg "delete\(c\.conns" shared/relay/client/client.go -B 5 -A 2Repository: netbirdio/netbird
Length of output: 1092
🏁 Script executed:
rg "type connContainer" shared/relay/client/client.go -A 30Repository: netbirdio/netbird
Length of output: 911
🏁 Script executed:
rg "type Conn struct" shared/relay/client/client.go -A 20Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
rg "func.*Close\(\)" client/internal/peer/worker_relay.go -A 15Repository: netbirdio/netbird
Length of output: 43
ErrConnAlreadyExistsdoes not prove a relay connection is stale.This error only indicates an entry exists in the relay client's connection map; it carries no information about connection health or staleness. With the unconditional
CloseConnByPeerKeycall, a healthy relay connection from a prior handshake can be forcibly torn down when a new offer arrives and encounters the same peerID already in the map. Add a disconnected/stale state check before closing, or ensureOnNewOffercannot run while a live relay connection for this peer exists.🤖 Prompt for AI Agents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for flagging. The behavior is intentional - reasoning:
WorkerRelay.OnNewOfferis only invoked when the remote peer sends a fresh offer through the Guard component. Guard only sends offers after it has detected a disconnect (WG handshake timeout, ICE disconnect, etc.). If we're receiving a new offer, the remote peer already believes the existing connection is broken.Given that:
Remote peer has effectively invalidated the connection by initiating re-handshake. Reusing our side's stale conn entry (the original bug) leaves WG packets going into a dead pipe. Tearing it down and opening a fresh one matches the remote peer's state.
ErrConnAlreadyExistsafter an offer is the pathological case we're fixing. Under normal operation,OpenConndoesn't collide - each offer/answer cycle happens on a fresh session. The only time we hitErrConnAlreadyExistsis when the map wasn't cleaned up (e.g. remote NAT IP change without a relay-level close event reaching us).Even in the theoretical race (both peers send offers simultaneously and one just established a healthy conn that we then close), the cost is one extra round-trip: the next offer will re-establish. Compare to the current failure mode: indefinite tunnel death until manual restart.
A liveness check on the relay conn isn't straightforward - the connection is a proxied UDP pipe with no heartbeat of its own. The authoritative liveness signal is the WG handshake, which
WGWatcheralready owns. Adding a separate liveness probe on the relay conn would duplicate that logic.If the concern is the simultaneous-offer race, a cleaner guard would be per-peer serialization of
OnNewOffer- but that's orthogonal and the current race-recovery cost (one extra offer cycle) seems acceptable vs. the bug being fixed.Can add a code comment explaining this if that would help future readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fpenezic, thank you for the thorough explanation — the reasoning is solid and I withdraw the concern.The key invariant I missed is that
OnNewOfferis gated by Guard's disconnect detection, so by the time a new offer arrives with a colliding peerID, the existing relay conn is stale by contract, not just by inference. The simultaneous-offer edge case has a bounded, self-healing cost (one extra offer cycle) which is clearly preferable to indefinite tunnel death.A brief inline comment explaining the Guard invariant would definitely help future readers avoid the same confusion — something like:
(ᴗ˳ᴗ)
✏️ Learnings added
🧠 Learnings used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reopening this - my "Guard invariant" argument was wrong.
OnNewOfferfires on normal offer/answer exchanges too, not only after disconnect detection. Extended testing showed an infinite reconnect loop (3600+ iterations/hour) in multi-peer scenarios where your original concern manifested exactly as you described - healthy connections being torn down.Converting to draft while I add proper staleness detection (cooldown or liveness check). Thanks for catching this early - the "bounded, self-healing cost" claim I made was also wrong, the cost compounds into a persistent loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fpenezic, thanks for following up — glad you caught this before it landed. Removing the incorrect learning now.For staleness detection, a few concrete approaches to consider:
Option A — Explicit stale flag (cleanest)
Add an
atomic.Bool relayConnStalefield toWorkerRelay. WGWatcher/Guard sets ittruewhen it detects a disconnect before queuing the new offer.OnNewOfferchecks the flag and only callsCloseConnByPeerKeywhen it is set, then clears it. This makes the "stale by contract" invariant explicit in code rather than assumed.Option B — Health probe on the existing conn
Before closing, attempt a zero-byte write or read with a short deadline on the existing
relayedConn. If it fails immediately, it's stale. Adds a small RTT delay on the unhappy path but avoids tearing down healthy connections.Option C — Session generation counter
Increment a generation counter each time Guard detects a disconnect. Pass it through the offer.
OnNewOffercloses the existing conn only when the generation in the offer is newer than what was used to open the current conn. Requires a protocol change but is the most robust.Option A is probably the least invasive — it keeps all the new plumbing inside
WorkerRelay/Connwithout touching the relay client or the wire protocol.✏️ Learnings added
⛔ Learnings removed
🧠 Learnings used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the concrete options - went through all three.
Option A (explicit stale flag) - cleanest surface, but has a gap: if the remote peer experiences a NAT IP change and re-sends an offer before our side detects the disconnect locally (e.g. our WG handshake watcher hasn't timed out yet), the flag will be false on our end and we'd skip the close - leaving us with the original bug. The flag correctly covers the case where we initiated the recovery (
onWGDisconnectedwithpriority == Relay), but not the case where the remote side's state changed and ours is lagging behind.Option B (health probe) - I don't think this works for our setup. The relay conn is a proxied UDP pipe; a zero-byte read blocks until a packet arrives, and a write to the local socket will always succeed regardless of whether the relay server still has a live pipe to the remote peer. There's no application-layer liveness signal on the relay conn itself - the authoritative signal is the WG handshake, which is what the original bug is about. Happy to be corrected if I'm missing a cheap probe.
Option C (generation counter) - agreed this is the most robust, but it's a wire-protocol change that would need coordination across signal/management and older clients. Too broad for this PR.
Proposal - hybrid A + per-peer serialization:
relayConnStale atomic.BoolonWorkerRelay, set in two places:Conn.onWGDisconnectedwhencurrentConnPriority == Relay(we detected the disconnect)CloseConn()pathOnNewOfferonly takes the close-and-retry path when the flag is set, then clears it.OnNewOfferso rapid back-to-back offers from the remote can't overlap with an in-flight close/reopen - the second offer waits for the first to finish, sees the fresh conn, andOpenConnreturns an existing-but-healthy conn which we can reuse.trueafter a successful reopen.This trades off completeness (we no longer auto-recover from a remote-initiated state change we haven't noticed yet) for correctness (we stop tearing down healthy connections). The original bug was narrower than I framed it - it's specifically about our side holding a stale entry after remote went through a transport-layer change we should have observed but didn't. The flag captures exactly that window.
Does this match what you had in mind for Option A, or were you thinking of a different signal source for setting the flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fpenezic, the analysis is sharp and the proposal is sound. A few thoughts:On the gap in Option A: You're right that it's real, but it's acceptable. In the remote-initiated NAT change where our WGWatcher hasn't fired yet, the worst case is now "delayed recovery (WGWatcher timeout, typically ~3 min)" instead of "indefinite tunnel death." WGWatcher will eventually fire, set the flag, and the next offer cycle recovers. That's a meaningful improvement over the status quo.
Additional stale flag signal:
onRelayClientDisconnected→onRelayDisconnectedis another natural place to set the flag. When the relay server does send a close event (the happy path), that definitively marks the conn as stale — no WGWatcher timeout needed. This closes part of the gap without adding complexity:This way the flag is set by three events: our WGWatcher fires, the relay server sends a close, or
CloseConn()is called directly.On the cooldown: If the flag is always cleared atomically inside
OnNewOfferimmediately after a successful reopen, there's no stuck-flag scenario — the only way it staystrueis ifOpenConnfails, in which case the conn didn't get re-established anyway. I'd skip the 5s cooldown and rely oncompareAndSwap-style clear-on-success instead.On the per-peer mutex: Worth checking whether
OnNewOfferis already serialized by its caller. If the signal handler dispatches offers for a given peer on a single goroutine, the mutex is redundant. If offers from different signal messages can race (e.g., separate goroutines), the mutex is necessary. Do you know the call chain? If it's already serialized, you can drop step 3 entirely and simplify.Overall the hybrid A approach is the right call — it's correct (no healthy-conn teardown) and recovers from the common case (we detect the disconnect first) without the protocol complexity of Option C.
🧠 Learnings used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified the call chain -
AsyncOfferListeneralready serializes per-peer (one goroutine at a time; newer offers during processing just updatelatestand run after current completes). So step 3 is redundant as you suspected. Dropping the mutex.Final plan:
relayConnStale atomic.BoolonWorkerRelayConn.onWGDisconnected(priority == Relay branch) - local WG watcher firedonRelayClientDisconnected- relay server sent close eventWorkerRelay.CloseConn()- any explicit teardownOnNewOffer: close-and-retry only when flag is set, clear on successful reopen (compareAndSwap)The tradeoff: remote-initiated NAT change where our WG watcher hasn't fired yet means delayed recovery (~WG handshake timeout) instead of tunnel death. Net improvement over current master. Pushing shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fpenezic, the plan is solid. A few notes for the implementation:On the CAS: Since
AsyncOfferListeneralready serializesOnNewOfferper-peer, there's no concurrent accessor onrelayConnStalefrom that direction — a plainLoad()/Store(false)is race-free and equivalent to CAS here. The only concurrent writers areonRelayClientDisconnected(goroutine) andCloseConn()(called from various places), but those are write-only (alwaysStore(true)), so a torn read isn't possible. No need for the overhead of a CAS loop.One edge case to double-check:
onRelayClientDisconnectedspawnsgo w.conn.onRelayDisconnected()— so the flag set and the offer dispatch are on separate goroutines. SinceAsyncOfferListeneralready handles the "offer arrives before flag is set" race by queuing and re-running, this shouldn't be an issue, but worth a quick trace to confirm the ordering is guaranteed on that path.CloseConn()as a signal source: Note thatCloseConn()is also called insideonRelayConnectionIsReadyifAddCloseListenerfails. That path will correctly mark the conn stale before returning, which is the right behavior.Overall the plan is correct and the tradeoff (delayed recovery on remote-side NAT change we haven't observed yet, vs. infinite reconnect loop) is clearly the right call. Looking forward to the push.
🧠 Learnings used