fix: close stale relay connection on ErrConnAlreadyExists to recover … by fpenezic · Pull Request #5866 · netbirdio/netbird

fpenezic · 2026-04-12T18:44:58Z

Summary

Fix peer reconnection loops after a network event (PPPoE reconnect, NAT/conntrack flush, IP rotation) leaves transport state stale on either the relay or ICE path. Both manifest as the WireGuard tunnel going silent while the client UI reports Connected, with the connection never recovering on its own.

This PR addresses two distinct but related failure modes observed in production on a peer behind PPPoE NAT (Raspberry Pi) talking to a peer with a public IP (Oracle Cloud).

Problem 1 - Relay path: stale `ErrConnAlreadyExists` reuse

When the peer is running over relay and a network event invalidates the existing relay session, WorkerRelay.OnNewOffer calls relayManager.OpenConn which returns ErrConnAlreadyExists because the relay client still holds a map entry for the peer key. The previous behavior was to silently bail out and reuse that entry - which is dead.

Fix (commits 1–4):

On ErrConnAlreadyExists, close the existing relay conn and reopen it.
Route CloseConnByPeerKey to the correct relay client (home vs. foreign server) - addresses CodeRabbit feedback.
Gate the close-and-retry behind a relayConnStale atomic.Bool flag so we only tear down when something has signaled that the entry is no longer backed by a live peer session. Without this gate, rapid successive offers from the remote peer cause an infinite tear-down/rebuild loop (observed: 3608 cycles in ~1 hour, ~9 Mbit/s of constant traffic).
Signal sources for the stale flag: conn.onWGDisconnected (Relay path), WorkerRelay.CloseConn, WorkerRelay.onRelayClientDisconnected.

Problem 2 - ICE path: session ID not rotated on WG timeout

When WireGuard handshake times out while running over ICE, onWGDisconnected calls workerICE.Close(). Close() sets w.agent = nil synchronously before pion's ICE library fires the asynchronous ConnectionStateClosed event. By the time onConnectionStateChange runs closeAgent(), the w.agent == agent guard fails (w.agent is already nil) and the session ID is not rotated.

Without rotation, the next ICE offer carries the same local session ID. The remote peer in OnNewOffer compares remoteSessionID against the incoming offer's SessionID, finds them equal, and skips agent recreation - reusing the existing agent and its stale candidates from the broken network state.

Observed in production: 30s reconnect loop with the same offer session ID logged 70+ times in a row, "ICE connection succeeded" on every cycle, WG handshake never recovering. UI shows Connected, P2P while last successful WG handshake is 40+ minutes old.

Fix (commit 5): Rotate the session ID explicitly in onWGDisconnected for the ICE case (mirroring the existing Relay-path behavior we added in commit 1) so the remote peer always recreates its ICE agent after a WG timeout on ICE.

Test plan

Local build passes (go build ./...)
Deployed on two production peers (RPi behind PPPoE NAT, OCI public IP) for live testing
Verified stable P2P connection with WG handshake refreshing on the normal 3-minute cadence (no regression)
Verified no infinite tear-down loop on relay-active peer (the bug introduced by the v1 fix and resolved in v2)
Pending: validate that an actual PPPoE/NAT reconnect now recovers without manual netbird down/up (waiting for the next nightly PPPoE cycle)

Notes

The two problems share the same trigger (network event invalidates per-peer state without invalidating the local control plane) but live in different transports, so the fixes are orthogonal. Each is necessary on its own:

Without the relay fix, peers stuck on relay loop the relay close/reopen cycle.
Without the ICE fix, peers that fall back to relay then re-establish ICE get pinned to a stale ICE agent on the remote side.

Both fixes are strictly defensive - they can only fire on ErrConnAlreadyExists (relay) or after a WG handshake timeout (ICE), so there's no behavior change on the happy path.

CLAassistant · 2026-04-12T18:45:10Z

All committers have signed the CLA.

coderabbitai · 2026-04-12T18:45:18Z

📝 Walkthrough

Walkthrough

Adds WorkerICE.ResetSessionID(), calls it when a relay-backed WireGuard connection disconnects, and introduces CloseConnByPeerKey(...) in relay client/manager plus eviction-and-retry logic when OpenConn returns ErrConnAlreadyExists.

Changes

Cohort / File(s)	Summary
ICE session handling `client/internal/peer/conn.go`, `client/internal/peer/worker_ice.go`	Added `WorkerICE.ResetSessionID()` and invoke it (nil-guarded) from `Conn.onWGDisconnected()` when `currentConnPriority == conntype.Relay`.
Relay eviction & retry `client/internal/peer/worker_relay.go`, `shared/relay/client/client.go`, `shared/relay/client/manager.go`	Added `Client.CloseConnByPeerKey(peerKey)` and `Manager.CloseConnByPeerKey(...)`. On `OpenConn` returning `ErrConnAlreadyExists`, code now evicts the stale connection via `CloseConnByPeerKey` and retries `OpenConn` once; logs adjusted accordingly.

Sequence Diagram(s)

sequenceDiagram
  participant Conn
  participant WorkerRelay
  participant RelayClient
  participant WorkerICE

  Conn->>WorkerRelay: onWGDisconnected (currentConnPriority == Relay)
  WorkerRelay->>RelayClient: request CloseConnByPeerKey(peerKey)
  RelayClient-->>WorkerRelay: removed/closed ack
  WorkerRelay->>WorkerICE: ResetSessionID()
  WorkerICE-->>WorkerRelay: sessionID refreshed, remoteSessionID cleared
  WorkerRelay->>Conn: continue disconnect handling

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

[client] Reset WireGuard endpoint on ICE session change during relay fallback #5283: Related ICE session handling and session/reset propagation changes.
[client] Extend WG watcher for ICE connection too #5133: Prior modifications to WG disconnect lifecycle and relay/ICE interaction.

Suggested reviewers

lixmal
mlsmaycon

Poem

🐰 When relays drop and paths renew,
I twitch my nose and make things new,
I clear the ICE name, chase stale away,
Retry the hop, then dance and play,
Hoppity-hop — connections chew!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: close stale relay connection on ErrConnAlreadyExists to recover ...' accurately summarizes the main change: handling stale relay connections to enable tunnel recovery.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description is comprehensive and well-structured, detailing two distinct problems, their fixes, test plan, and production validation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@shared/relay/client/manager.go`:
- Around line 150-160: The CloseConnByPeerKey method currently always calls
m.relayClient.CloseConnByPeerKey(peerKey) which misses stale connections stored
under a specific server entry; modify Manager.CloseConnByPeerKey to
accept/lookup the serverAddress (use m.relayClients[srv].relayClient when
present) and call that relayClient.CloseConnByPeerKey(peerKey) instead of always
using m.relayClient; ensure the logic still handles the nil/default
m.relayClient case and preserves the mutex usage around m.relayClients access;
also update the caller in client/internal/peer/worker_relay.go to pass the srv
value into CloseConnByPeerKey so cross-relay ErrConnAlreadyExists recovery finds
and removes the stale entry created in openConnVia/OpenConn.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dcc48e08-b6cd-48da-ad60-56304d5f8378

📥 Commits

Reviewing files that changed from the base of the PR and between 5259e5d and d2a2ff5.

📒 Files selected for processing (5)

client/internal/peer/conn.go
client/internal/peer/worker_ice.go
client/internal/peer/worker_relay.go
shared/relay/client/client.go
shared/relay/client/manager.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/peer/worker_relay.go`:
- Around line 66-75: The code currently treats relayClient.ErrConnAlreadyExists
as proof the existing relay is stale and unconditionally calls
w.relayManager.CloseConnByPeerKey; instead, check the existing connection's
health/state before closing or prevent concurrent OnNewOffer for the same peer.
Modify the branch handling relayClient.ErrConnAlreadyExists to first query the
existing connection via w.relayManager (e.g., a method like GetConnByPeerKey or
HasActiveConn) and only call w.relayManager.CloseConnByPeerKey when that
connection reports closed/stale (or fail a liveness check); otherwise reuse the
active connection or bail out/serialize the new offer (e.g., with a per-peer
lock) rather than tearing down a healthy relay, keeping the subsequent OpenConn
call only when a close was actually performed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 271f46a4-0e60-4871-8a37-51c6772eb90e

📥 Commits

Reviewing files that changed from the base of the PR and between a9783e0 and 7d55ac2.

📒 Files selected for processing (5)

client/internal/peer/conn.go
client/internal/peer/worker_ice.go
client/internal/peer/worker_relay.go
shared/relay/client/client.go
shared/relay/client/manager.go

🚧 Files skipped from review as they are similar to previous changes (3)

client/internal/peer/conn.go
shared/relay/client/manager.go
shared/relay/client/client.go

coderabbitai · 2026-04-20T21:26:34Z

 		if errors.Is(err, relayClient.ErrConnAlreadyExists) {
-			w.log.Debugf("handled offer by reusing existing relay connection")
+			w.log.Infof("relay conn already exists, closing stale conn and retrying")
+			w.relayManager.CloseConnByPeerKey(srv, w.config.Key)
+			relayedConn, err = w.relayManager.OpenConn(w.peerCtx, srv, w.config.Key)
+			if err != nil {
+				w.log.Errorf("failed to reopen connection via Relay after closing stale: %s", err)
+				return
+			}
+		} else {
+			w.log.Errorf("failed to open connection via Relay: %s", err)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f "worker_relay.go" --exec head -100 {}

Repository: netbirdio/netbird

Length of output: 2725

🏁 Script executed:

fd -t f "client.go" -path "*/relay/client/*" --exec wc -l {}

Repository: netbirdio/netbird

Length of output: 229

🏁 Script executed:

rg "ErrConnAlreadyExists" -A 3 -B 3

Repository: netbirdio/netbird

Length of output: 1384

🏁 Script executed:

fd -t f "worker_relay.go" -exec cat -n {} \;

Repository: netbirdio/netbird

Length of output: 291

🏁 Script executed:

fd -path "*/relay/client/client.go"

Repository: netbirdio/netbird

Length of output: 229

🏁 Script executed:

fd -path "*/relay/client/client.go" -exec wc -l {} \;

Repository: netbirdio/netbird

Length of output: 229

🏁 Script executed:

fd -path "*/relay/client/client.go" -exec head -200 {} \;

Repository: netbirdio/netbird

Length of output: 229

🏁 Script executed:

rg "func.*Manager.*OpenConn" -A 20

Repository: netbirdio/netbird

Length of output: 1284

🏁 Script executed:

rg "CloseConnByPeerKey" -A 10 -B 3

Repository: netbirdio/netbird

Length of output: 4079

🏁 Script executed:

rg "OnNewOffer" -B 5 -A 2

Repository: netbirdio/netbird

Length of output: 2252

🏁 Script executed:

rg "type WorkerRelay" -A 50 | head -80

Repository: netbirdio/netbird

Length of output: 3608

🏁 Script executed:

rg "func.*onRelayConnectionIsReady" -A 15

Repository: netbirdio/netbird

Length of output: 979

🏁 Script executed:

rg "type Handshaker" -A 30

Repository: netbirdio/netbird

Length of output: 2553

🏁 Script executed:

rg "AddRelayListener" -B 5 -A 10

Repository: netbirdio/netbird

Length of output: 3701

🏁 Script executed:

rg "relayListeners" -B 3 -A 10

Repository: netbirdio/netbird

Length of output: 43

🏁 Script executed:

rg "type AsyncOfferListener" -A 40

Repository: netbirdio/netbird

Length of output: 2686

🏁 Script executed:

rg "NewAsyncOfferListener" -A 30

Repository: netbirdio/netbird

Length of output: 5775

🏁 Script executed:

rg "OnRemoteOffer" -B 3 -A 10

Repository: netbirdio/netbird

Length of output: 4594

🏁 Script executed:

cat -n client/internal/peer/handshaker_listener.go

Repository: netbirdio/netbird

Length of output: 1653

🏁 Script executed:

rg "CloseConnByPeerKey" --type go -B 2 -A 8

Repository: netbirdio/netbird

Length of output: 3587

🏁 Script executed:

rg "onRelayClientDisconnected" -B 3 -A 15

Repository: netbirdio/netbird

Length of output: 1652

🏁 Script executed:

rg "func.*onRelayConnectionIsReady" -B 2 -A 20

Repository: netbirdio/netbird

Length of output: 1331

🏁 Script executed:

rg "relayLock" -B 2 -A 2

Repository: netbirdio/netbird

Length of output: 1090

🏁 Script executed:

rg "type.*Conn.*struct" -A 20 shared/relay/client/client.go | head -80

Repository: netbirdio/netbird

Length of output: 43

🏁 Script executed:

rg "conns\[" shared/relay/client/client.go -B 2 -A 2

Repository: netbirdio/netbird

Length of output: 1303

🏁 Script executed:

rg "delete\(c\.conns" shared/relay/client/client.go -B 5 -A 2

Repository: netbirdio/netbird

Length of output: 1092

🏁 Script executed:

rg "type connContainer" shared/relay/client/client.go -A 30

Repository: netbirdio/netbird

Length of output: 911

🏁 Script executed:

rg "type Conn struct" shared/relay/client/client.go -A 20

Repository: netbirdio/netbird

Length of output: 43

🏁 Script executed:

rg "func.*Close\(\)" client/internal/peer/worker_relay.go -A 15

Repository: netbirdio/netbird

Length of output: 43

ErrConnAlreadyExists does not prove a relay connection is stale.

This error only indicates an entry exists in the relay client's connection map; it carries no information about connection health or staleness. With the unconditional CloseConnByPeerKey call, a healthy relay connection from a prior handshake can be forcibly torn down when a new offer arrives and encounters the same peerID already in the map. Add a disconnected/stale state check before closing, or ensure OnNewOffer cannot run while a live relay connection for this peer exists.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@client/internal/peer/worker_relay.go` around lines 66 - 75, The code currently treats relayClient.ErrConnAlreadyExists as proof the existing relay is stale and unconditionally calls w.relayManager.CloseConnByPeerKey; instead, check the existing connection's health/state before closing or prevent concurrent OnNewOffer for the same peer. Modify the branch handling relayClient.ErrConnAlreadyExists to first query the existing connection via w.relayManager (e.g., a method like GetConnByPeerKey or HasActiveConn) and only call w.relayManager.CloseConnByPeerKey when that connection reports closed/stale (or fail a liveness check); otherwise reuse the active connection or bail out/serialize the new offer (e.g., with a per-peer lock) rather than tearing down a healthy relay, keeping the subsequent OpenConn call only when a close was actually performed.

Thanks for flagging. The behavior is intentional - reasoning:

WorkerRelay.OnNewOffer is only invoked when the remote peer sends a fresh offer through the Guard component. Guard only sends offers after it has detected a disconnect (WG handshake timeout, ICE disconnect, etc.). If we're receiving a new offer, the remote peer already believes the existing connection is broken.

Given that:

Remote peer has effectively invalidated the connection by initiating re-handshake. Reusing our side's stale conn entry (the original bug) leaves WG packets going into a dead pipe. Tearing it down and opening a fresh one matches the remote peer's state.

ErrConnAlreadyExists after an offer is the pathological case we're fixing. Under normal operation, OpenConn doesn't collide - each offer/answer cycle happens on a fresh session. The only time we hit ErrConnAlreadyExists is when the map wasn't cleaned up (e.g. remote NAT IP change without a relay-level close event reaching us).

Even in the theoretical race (both peers send offers simultaneously and one just established a healthy conn that we then close), the cost is one extra round-trip: the next offer will re-establish. Compare to the current failure mode: indefinite tunnel death until manual restart.

A liveness check on the relay conn isn't straightforward - the connection is a proxied UDP pipe with no heartbeat of its own. The authoritative liveness signal is the WG handshake, which WGWatcher already owns. Adding a separate liveness probe on the relay conn would duplicate that logic.

If the concern is the simultaneous-offer race, a cleaner guard would be per-peer serialization of OnNewOffer - but that's orthogonal and the current race-recovery cost (one extra offer cycle) seems acceptable vs. the bug being fixed.

Can add a code comment explaining this if that would help future readers.

@fpenezic, thank you for the thorough explanation — the reasoning is solid and I withdraw the concern.

The key invariant I missed is that OnNewOffer is gated by Guard's disconnect detection, so by the time a new offer arrives with a colliding peerID, the existing relay conn is stale by contract, not just by inference. The simultaneous-offer edge case has a bounded, self-healing cost (one extra offer cycle) which is clearly preferable to indefinite tunnel death.

A brief inline comment explaining the Guard invariant would definitely help future readers avoid the same confusion — something like:

// ErrConnAlreadyExists here means the conn map was not cleaned up after the // previous session (e.g. remote NAT IP change with no relay-level close event). // Guard only triggers OnNewOffer after detecting a disconnect, so the existing // entry is stale by contract. Close it and open a fresh conn to match the // remote peer's state. if errors.Is(err, relayClient.ErrConnAlreadyExists) {

(ᴗ˳ᴗ)

✏️ Learnings added

Learnt from: fpenezic Repo: netbirdio/netbird PR: 5866 File: client/internal/peer/worker_relay.go:66-75 Timestamp: 2026-04-21T06:14:04.853Z Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` is only invoked by the Guard component after it has detected a disconnect (WG handshake timeout, ICE disconnect, etc.). Therefore, when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer`, the existing relay conn entry is stale by contract — Guard serializes offers post-disconnect, so the collision cannot be from a racing healthy connection. Unconditionally calling `CloseConnByPeerKey` and retrying `OpenConn` is the correct and intentional behavior. Do not flag this as a potential teardown of a healthy relay connection.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: lixmal Repo: netbirdio/netbird PR: 5826 File: client/internal/portforward/manager.go:35-38 Timestamp: 2026-04-08T14:55:48.251Z Learning: In `client/internal/portforward/manager.go` (netbirdio/netbird), persisting permanent-lease (ttl=0/UPnP error 725 fallback) port mapping state for crash-recovery cleanup is intentionally deferred to a follow-up PR. The blocker is that `State.Cleanup` requires NAT gateway re-discovery via `nat.DiscoverGateway`, which can block startup for ~10 seconds when no gateway is present, affecting all clients. The TODO comment at line ~35 documents this constraint. Do not flag the missing state persistence for permanent leases as a blocking issue in this PR.

Learnt from: pappz Repo: netbirdio/netbird PR: 5770 File: relay/server/listener/quic/conn.go:34-38 Timestamp: 2026-04-01T18:47:37.952Z Learning: In netbirdio/netbird relay/server/listener/quic/conn.go, `Conn.Write` intentionally ignores the context parameter (`_ context.Context`) for QUIC `SendDatagram` calls. The design guarantees that the underlying QUIC session is always closed before the associated context is canceled, so `SendDatagram` will return a connection-closed error naturally without requiring a `ctx.Err()` pre-check. Do not flag the unused context in `Conn.Write` as a bug in this codebase.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5652 File: client/cmd/status.go:221-221 Timestamp: 2026-03-22T18:18:11.890Z Learning: In netbirdio/netbird (`client/cmd/status.go`), `checkStartup` reads relay availability from the status recorder, which is updated in real time by the relay client on connect/disconnect. `ShouldRunProbes` only gates STUN/TURN reachability checks (blocking probes via `ProbeAllWaitResult`). Since `checkStartup` does not evaluate STUN/TURN probe results, passing `ShouldRunProbes=false` for startup health checks is intentional and correct — it avoids blocking latency with no benefit. Do not flag `getStatus(ctx, isStartup, false)` in `runHealthCheck` as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5530 File: management/internals/modules/reverseproxy/service/manager/manager.go:302-317 Timestamp: 2026-03-07T14:48:20.330Z Learning: In `management/internals/modules/reverseproxy/service/manager/manager.go`, `checkPortConflict` uses `LockingStrengthUpdate` (SELECT FOR UPDATE) to serialize on existing port rows. For an unused port (no existing rows to lock), two concurrent transactions could both pass the DB check, but the proxy's OS-level port binding acts as the hard uniqueness guard — only one process can bind a given port. This dual-layer enforcement (DB best-effort + proxy binding) is the accepted design; no additional DB uniqueness constraint is needed.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/server/account.go:2549-2595 Timestamp: 2026-04-08T06:14:34.119Z Learning: In `management/server/account.go` (netbirdio/netbird, PR `#5698`), `updatePeerIPv6InTransaction` intentionally does NOT apply the same host-part filtering that `AllocateRandomPeerIPv6()` uses (e.g., avoiding the all-zero host / subnet-router anycast address per RFC 4291 §2.6.1). IPv6 has no broadcast address, and subnet-router anycast is irrelevant for overlay addresses. The allocator avoids these addresses by convention only; an admin explicitly setting a specific peer IPv6 via the API is permitted to use any valid address within the account's v6 network range. Do not flag the absence of allocator-style host-part validation in `updatePeerIPv6InTransaction` as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/internals/shared/grpc/conversion.go:0-0 Timestamp: 2026-03-26T09:09:54.859Z Learning: In netbirdio/netbird (management/internals/shared/grpc/conversion.go), the two IPv6 guards in `toPeerConfig` and `ToSyncResponse` serve different purposes and must NOT be unified. `toPeerConfig` uses `peer.IPv6.IsValid() && network.NetV6.IP != nil` because it constructs a prefix from the network mask and needs to guard against a nil mask. `ToSyncResponse` uses `peer.SupportsIPv6() && peer.IPv6.IsValid()` for `includeIPv6` and does not need the network object. Additionally, `peer.IPv6.IsValid() == true` is already proof that the peer is in an `IPv6EnabledGroups` group — IPv6 is only allocated to peers in those groups — so no additional group-membership gate is needed at the gRPC conversion layer. Do not flag these two distinct checks as a unification opportunity.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5668 File: client/firewall/create_linux.go:45-59 Timestamp: 2026-03-23T19:24:39.071Z Learning: In netbirdio/netbird (client/firewall/create_linux.go), `AllowNetbird()` on the native firewall (nftables/iptables) adds blanket accept rules. These rules should ONLY be called when USPFilter wraps the native firewall as a backend (in-process filtering): USPFilter does per-peer ACL enforcement and needs the native firewall to pass all traffic through. When the native firewall is used directly (returned without USPFilter wrapping), `AllowNetbird()` must NOT be called, because the native firewall manages its own ACL rules and blanket accepts would bypass them. Do not flag the absence of `AllowNetbird()` on the native-only return path as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5687 File: client/internal/engine_ssh.go:44-50 Timestamp: 2026-03-25T06:24:52.840Z Learning: In `client/internal/engine_ssh.go` (netbirdio/netbird), the IPv6 SSH setup steps — `AddListener` and `AddInboundDNAT` for the IPv6 overlay address — are intentionally best-effort and independent of each other, mirroring the v4 behavior. IPv6 soft-failure is by design so that IPv4 SSH continues to work when IPv6 cannot be set up. Do not flag the lack of atomicity between the IPv6 listener and IPv6 DNAT as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5926 File: client/internal/sleep/detector_darwin.go:147-213 Timestamp: 2026-04-20T05:15:45.913Z Learning: In netbirdio/netbird, `SleepHandler` (client/server/sleep.go) already serializes sleep/wake lifecycle events, so a per-`Detector` callback mutex in `client/internal/sleep/detector_darwin.go` is intentionally omitted as redundant. Do not flag concurrent callback invocations on `Detector.triggerCallback` as a bug; the serialization guarantee lives in the layer above.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5530 File: proxy/server.go:930-941 Timestamp: 2026-03-07T13:25:18.409Z Learning: In proxy/server.go, `modifyMapping()` intentionally removes old routes before setting up new ones. Old routes hold the port listener, so attempting to bind the same port without first releasing it would result in an address-in-use error. The brief service interruption during modify is an accepted design trade-off, not a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5202 File: client/internal/portforward/manager.go:219-222 Timestamp: 2026-01-30T14:43:13.106Z Learning: In client/internal/portforward/manager.go, the state persistence pattern is designed for retry-on-failure: if DeletePortMapping fails in Stop(), the state is intentionally left intact (via early return) so that cleanupResidual() can attempt deletion again on the next startup. This is the intended behavior, not a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5688 File: client/firewall/uspfilter/filter.go:1281-1292 Timestamp: 2026-03-25T09:21:25.418Z Learning: In netbirdio/netbird (`client/firewall/uspfilter/filter.go`), IPv4 and IPv6 fragments are intentionally allowed through (fast-path `return false` in `filterInbound`) without going through `peerACLsBlock`, `routeACLsPass`, or the forwarder. This is pre-existing behaviour tracked by the TODO comment "pass fragments of routed packets to forwarder". Do not flag the IPv6 fragment fast-path (`d.ip6.NextHeader == layers.IPProtocolIPv6Fragment` returning `(true, true)`) as a security regression; it mirrors the existing IPv4 fragment handling by design.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5675 File: client/iface/device/wg_link_freebsd.go:61-84 Timestamp: 2026-03-24T11:39:03.789Z Learning: In `client/iface/device/wg_link_freebsd.go` (and related platform files), `UpdateAddr` is never called mid-lifetime by the engine. IPv6 overlay address changes are detected in `engine.go` via `hasIPv6Changed` and trigger `ErrResetConnection`, which causes a full client restart and interface recreation from scratch. Therefore, `assignAddr` always runs on a fresh interface, and there is no need to clean up pre-existing OS inet6 aliases in the soft-fail (ClearIPv6) path.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5587 File: shared/management/http/api/openapi.yml:3151-3171 Timestamp: 2026-03-13T06:22:59.436Z Learning: Repo netbirdio/netbird PR `#5587`: For reverse proxy header authentication, HeaderAuthConfig.value is required in requests and cleared in responses. On update, Manager.preserveExistingAuthSecrets() calls preserveHeaderAuthHashes() before validateHeaderAuthValues() (management/internals/modules/reverseproxy/service/manager/manager.go Lines ~555–597), so sending value: "" preserves the existing hashed secret; providing a non-empty value rotates it.

Learnt from: bcmmbaga Repo: netbirdio/netbird PR: 5441 File: management/server/user.go:745-749 Timestamp: 2026-02-24T19:32:13.189Z Learning: In Go codebases like netbirdio/netbird, methods returning (T, error) should follow the convention: if an error is returned, propagate it and return early. When the caller handles the error immediately, explicit nil checks on the returned value are unnecessary; rely on the error to guard control flow and only inspect the value when err is nil. This guidance applies broadly to Store methods such as GetUserByUserID in Go code; prefer early return on error and avoid redundant nil checks after a successful error path.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5530 File: proxy/internal/udp/relay.go:276-278 Timestamp: 2026-03-07T13:59:16.506Z Learning: For Go 1.25+ projects, use sync.WaitGroup.Go() to run goroutines because it handles Add(1) and defer Done() automatically. Do not flag wg.Go(...) as a compilation error in codebases that require Go 1.25 or newer. If supporting older Go versions, gate such usage with build tags or avoid wg.Go() to maintain compatibility; verify the module targets Go 1.25+ before adopting this pattern.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5688 File: client/firewall/uspfilter/forwarder/icmp.go:268-287 Timestamp: 2026-03-25T09:21:53.814Z Learning: When using gVisor’s `header.ICMPv6Checksum(ICMPv6ChecksumParams)` (`gvisor.dev/gvisor/pkg/tcpip/header`), the pseudo-header checksum is computed internally from `Src`, `Dst`, and `Header`. Use `PayloadCsum`/`PayloadLen` only for scattered-buffer cases where the ICMPv6 payload bytes are not included in `Header`. If `Header` already contains the complete ICMPv6 message (ICMPv6 header + all data), call `ICMPv6Checksum` with `PayloadCsum: 0` and `PayloadLen: 0` (or omit those fields). Do not pass a separately computed pseudo-header checksum (e.g., `header.PseudoHeaderChecksum(...)`) as `PayloadCsum` in that case, because it will be double-counted and produce an incorrect checksum.

Reopening this - my "Guard invariant" argument was wrong. OnNewOffer fires on normal offer/answer exchanges too, not only after disconnect detection. Extended testing showed an infinite reconnect loop (3600+ iterations/hour) in multi-peer scenarios where your original concern manifested exactly as you described - healthy connections being torn down.

Converting to draft while I add proper staleness detection (cooldown or liveness check). Thanks for catching this early - the "bounded, self-healing cost" claim I made was also wrong, the cost compounds into a persistent loop.

@fpenezic, thanks for following up — glad you caught this before it landed. Removing the incorrect learning now.

For staleness detection, a few concrete approaches to consider:

Option A — Explicit stale flag (cleanest)
Add an atomic.Bool relayConnStale field to WorkerRelay. WGWatcher/Guard sets it true when it detects a disconnect before queuing the new offer. OnNewOffer checks the flag and only calls CloseConnByPeerKey when it is set, then clears it. This makes the "stale by contract" invariant explicit in code rather than assumed.

Option B — Health probe on the existing conn
Before closing, attempt a zero-byte write or read with a short deadline on the existing relayedConn. If it fails immediately, it's stale. Adds a small RTT delay on the unhappy path but avoids tearing down healthy connections.

Option C — Session generation counter
Increment a generation counter each time Guard detects a disconnect. Pass it through the offer. OnNewOffer closes the existing conn only when the generation in the offer is newer than what was used to open the current conn. Requires a protocol change but is the most robust.

Option A is probably the least invasive — it keeps all the new plumbing inside WorkerRelay/Conn without touching the relay client or the wire protocol.

✏️ Learnings added

Learnt from: fpenezic Repo: netbirdio/netbird PR: 5866 File: client/internal/peer/worker_relay.go:71-80 Timestamp: 2026-04-22T22:05:59.868Z Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` fires on ALL offer/answer exchanges (normal re-keying included), NOT only after Guard detects a disconnect. Unconditionally calling `CloseConnByPeerKey` when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer` causes healthy relay connections to be torn down, producing an infinite reconnect loop (observed at 3600+ iterations/hour in multi-peer scenarios). A proper staleness check (e.g., an explicit stale flag set by WGWatcher before queuing the offer, or a health probe on the existing conn) is required before closing the existing connection.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

⛔ Learnings removed

Learnt from: fpenezic Repo: netbirdio/netbird PR: 5866 File: client/internal/peer/worker_relay.go:66-75 Timestamp: 2026-04-21T06:14:08.311Z Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` is only invoked by the Guard component after it has detected a disconnect (WG handshake timeout, ICE disconnect, etc.). Therefore, when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer`, the existing relay conn entry is stale by contract — Guard serializes offers post-disconnect, so the collision cannot be from a racing healthy connection. Unconditionally calling `CloseConnByPeerKey` and retrying `OpenConn` is the correct and intentional behavior. Do not flag this as a potential teardown of a healthy relay connection.

🧠 Learnings used

Learnt from: fpenezic Repo: netbirdio/netbird PR: 5866 File: client/internal/peer/worker_relay.go:66-75 Timestamp: 2026-04-21T06:14:08.311Z Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` is only invoked by the Guard component after it has detected a disconnect (WG handshake timeout, ICE disconnect, etc.). Therefore, when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer`, the existing relay conn entry is stale by contract — Guard serializes offers post-disconnect, so the collision cannot be from a racing healthy connection. Unconditionally calling `CloseConnByPeerKey` and retrying `OpenConn` is the correct and intentional behavior. Do not flag this as a potential teardown of a healthy relay connection.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5826 File: client/internal/portforward/manager.go:35-38 Timestamp: 2026-04-08T14:55:48.251Z Learning: In `client/internal/portforward/manager.go` (netbirdio/netbird), persisting permanent-lease (ttl=0/UPnP error 725 fallback) port mapping state for crash-recovery cleanup is intentionally deferred to a follow-up PR. The blocker is that `State.Cleanup` requires NAT gateway re-discovery via `nat.DiscoverGateway`, which can block startup for ~10 seconds when no gateway is present, affecting all clients. The TODO comment at line ~35 documents this constraint. Do not flag the missing state persistence for permanent leases as a blocking issue in this PR.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/server/account.go:2549-2595 Timestamp: 2026-04-08T06:14:34.119Z Learning: In `management/server/account.go` (netbirdio/netbird, PR `#5698`), `updatePeerIPv6InTransaction` intentionally does NOT apply the same host-part filtering that `AllocateRandomPeerIPv6()` uses (e.g., avoiding the all-zero host / subnet-router anycast address per RFC 4291 §2.6.1). IPv6 has no broadcast address, and subnet-router anycast is irrelevant for overlay addresses. The allocator avoids these addresses by convention only; an admin explicitly setting a specific peer IPv6 via the API is permitted to use any valid address within the account's v6 network range. Do not flag the absence of allocator-style host-part validation in `updatePeerIPv6InTransaction` as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5688 File: client/firewall/uspfilter/filter.go:1281-1292 Timestamp: 2026-03-25T09:21:25.418Z Learning: In netbirdio/netbird (`client/firewall/uspfilter/filter.go`), IPv4 and IPv6 fragments are intentionally allowed through (fast-path `return false` in `filterInbound`) without going through `peerACLsBlock`, `routeACLsPass`, or the forwarder. This is pre-existing behaviour tracked by the TODO comment "pass fragments of routed packets to forwarder". Do not flag the IPv6 fragment fast-path (`d.ip6.NextHeader == layers.IPProtocolIPv6Fragment` returning `(true, true)`) as a security regression; it mirrors the existing IPv4 fragment handling by design.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/internals/shared/grpc/conversion.go:0-0 Timestamp: 2026-03-26T09:09:54.859Z Learning: In netbirdio/netbird (management/internals/shared/grpc/conversion.go), the two IPv6 guards in `toPeerConfig` and `ToSyncResponse` serve different purposes and must NOT be unified. `toPeerConfig` uses `peer.IPv6.IsValid() && network.NetV6.IP != nil` because it constructs a prefix from the network mask and needs to guard against a nil mask. `ToSyncResponse` uses `peer.SupportsIPv6() && peer.IPv6.IsValid()` for `includeIPv6` and does not need the network object. Additionally, `peer.IPv6.IsValid() == true` is already proof that the peer is in an `IPv6EnabledGroups` group — IPv6 is only allocated to peers in those groups — so no additional group-membership gate is needed at the gRPC conversion layer. Do not flag these two distinct checks as a unification opportunity.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5891 File: util/capture/filter.go:164-183 Timestamp: 2026-04-15T10:26:03.137Z Learning: In `util/capture/filter.go` (netbirdio/netbird, PR `#5891`), `parseIPv4Info` does not guard L4 port extraction on fragment offset == 0. Non-initial IPv4 fragments will have their payload bytes misread as src/dst ports, which can cause spurious matches for port-based filters. The fix (checking `binary.BigEndian.Uint16(data[6:8]) & 0x1fff == 0` before reading ports) was deferred to a future PR. Do not treat the missing fragment-offset guard as a blocking issue for PR `#5891`.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5668 File: client/firewall/create_linux.go:45-59 Timestamp: 2026-03-23T19:24:39.071Z Learning: In netbirdio/netbird (client/firewall/create_linux.go), `AllowNetbird()` on the native firewall (nftables/iptables) adds blanket accept rules. These rules should ONLY be called when USPFilter wraps the native firewall as a backend (in-process filtering): USPFilter does per-peer ACL enforcement and needs the native firewall to pass all traffic through. When the native firewall is used directly (returned without USPFilter wrapping), `AllowNetbird()` must NOT be called, because the native firewall manages its own ACL rules and blanket accepts would bypass them. Do not flag the absence of `AllowNetbird()` on the native-only return path as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5652 File: client/cmd/status.go:221-221 Timestamp: 2026-03-22T18:18:11.890Z Learning: In netbirdio/netbird (`client/cmd/status.go`), `checkStartup` reads relay availability from the status recorder, which is updated in real time by the relay client on connect/disconnect. `ShouldRunProbes` only gates STUN/TURN reachability checks (blocking probes via `ProbeAllWaitResult`). Since `checkStartup` does not evaluate STUN/TURN probe results, passing `ShouldRunProbes=false` for startup health checks is intentional and correct — it avoids blocking latency with no benefit. Do not flag `getStatus(ctx, isStartup, false)` in `runHealthCheck` as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5873 File: client/firewall/uspfilter/hooks_filter.go:46-57 Timestamp: 2026-04-13T13:47:11.779Z Learning: In `client/firewall/uspfilter/hooks_filter.go` (netbirdio/netbird, PR `#5873`), `HooksFilter.FilterOutbound` intentionally does not skip fragmented IPv4 packets or validate `ihl >= 20` before reading the L4 destination port. The rationale: the hook only fires when both `dstIP` AND `dstPort` match the registered DNS hook (a specific IP + port 53). For non-first fragments, `dstPort` would be read from payload bytes — the probability of those bytes equalling 53 while simultaneously matching the DNS server IP is negligible. Even if such a spurious match occurred, the hook callback would reject the garbage data as a malformed DNS packet. The fragment-skip guard is unnecessary complexity for this minimal hook-only filter. Do not re-flag the absence of fragment or IHL bounds checks in `HooksFilter` as a safety issue.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5926 File: client/internal/sleep/detector_darwin.go:147-213 Timestamp: 2026-04-20T05:15:45.913Z Learning: In netbirdio/netbird, `SleepHandler` (client/server/sleep.go) already serializes sleep/wake lifecycle events, so a per-`Detector` callback mutex in `client/internal/sleep/detector_darwin.go` is intentionally omitted as redundant. Do not flag concurrent callback invocations on `Detector.triggerCallback` as a bug; the serialization guarantee lives in the layer above.

Learnt from: pappz Repo: netbirdio/netbird PR: 5770 File: relay/server/listener/quic/conn.go:34-38 Timestamp: 2026-04-01T18:47:37.952Z Learning: In netbirdio/netbird relay/server/listener/quic/conn.go, `Conn.Write` intentionally ignores the context parameter (`_ context.Context`) for QUIC `SendDatagram` calls. The design guarantees that the underlying QUIC session is always closed before the associated context is canceled, so `SendDatagram` will return a connection-closed error naturally without requiring a `ctx.Err()` pre-check. Do not flag the unused context in `Conn.Write` as a bug in this codebase.

Learnt from: bcmmbaga Repo: netbirdio/netbird PR: 5441 File: management/server/user.go:745-749 Timestamp: 2026-02-24T19:32:13.189Z Learning: In Go codebases like netbirdio/netbird, methods returning (T, error) should follow the convention: if an error is returned, propagate it and return early. When the caller handles the error immediately, explicit nil checks on the returned value are unnecessary; rely on the error to guard control flow and only inspect the value when err is nil. This guidance applies broadly to Store methods such as GetUserByUserID in Go code; prefer early return on error and avoid redundant nil checks after a successful error path.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5530 File: proxy/internal/udp/relay.go:276-278 Timestamp: 2026-03-07T13:59:16.506Z Learning: For Go 1.25+ projects, use sync.WaitGroup.Go() to run goroutines because it handles Add(1) and defer Done() automatically. Do not flag wg.Go(...) as a compilation error in codebases that require Go 1.25 or newer. If supporting older Go versions, gate such usage with build tags or avoid wg.Go() to maintain compatibility; verify the module targets Go 1.25+ before adopting this pattern.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5688 File: client/firewall/uspfilter/forwarder/icmp.go:268-287 Timestamp: 2026-03-25T09:21:53.814Z Learning: When using gVisor’s `header.ICMPv6Checksum(ICMPv6ChecksumParams)` (`gvisor.dev/gvisor/pkg/tcpip/header`), the pseudo-header checksum is computed internally from `Src`, `Dst`, and `Header`. Use `PayloadCsum`/`PayloadLen` only for scattered-buffer cases where the ICMPv6 payload bytes are not included in `Header`. If `Header` already contains the complete ICMPv6 message (ICMPv6 header + all data), call `ICMPv6Checksum` with `PayloadCsum: 0` and `PayloadLen: 0` (or omit those fields). Do not pass a separately computed pseudo-header checksum (e.g., `header.PseudoHeaderChecksum(...)`) as `PayloadCsum` in that case, because it will be double-counted and produce an incorrect checksum.

Thanks for the concrete options - went through all three.

Option A (explicit stale flag) - cleanest surface, but has a gap: if the remote peer experiences a NAT IP change and re-sends an offer before our side detects the disconnect locally (e.g. our WG handshake watcher hasn't timed out yet), the flag will be false on our end and we'd skip the close - leaving us with the original bug. The flag correctly covers the case where we initiated the recovery (onWGDisconnected with priority == Relay), but not the case where the remote side's state changed and ours is lagging behind.

Option B (health probe) - I don't think this works for our setup. The relay conn is a proxied UDP pipe; a zero-byte read blocks until a packet arrives, and a write to the local socket will always succeed regardless of whether the relay server still has a live pipe to the remote peer. There's no application-layer liveness signal on the relay conn itself - the authoritative signal is the WG handshake, which is what the original bug is about. Happy to be corrected if I'm missing a cheap probe.

Option C (generation counter) - agreed this is the most robust, but it's a wire-protocol change that would need coordination across signal/management and older clients. Too broad for this PR.

Proposal - hybrid A + per-peer serialization:

Explicit relayConnStale atomic.Bool on WorkerRelay, set in two places:

Conn.onWGDisconnected when currentConnPriority == Relay (we detected the disconnect)

Defensively after any explicit CloseConn() path

OnNewOffer only takes the close-and-retry path when the flag is set, then clears it.

Per-peer mutex on OnNewOffer so rapid back-to-back offers from the remote can't overlap with an in-flight close/reopen - the second offer waits for the first to finish, sees the fresh conn, and OpenConn returns an existing-but-healthy conn which we can reuse.

Short cooldown (e.g. 5s) on the flag to prevent pathological loops if the flag somehow gets stuck in true after a successful reopen.

This trades off completeness (we no longer auto-recover from a remote-initiated state change we haven't noticed yet) for correctness (we stop tearing down healthy connections). The original bug was narrower than I framed it - it's specifically about our side holding a stale entry after remote went through a transport-layer change we should have observed but didn't. The flag captures exactly that window.

Does this match what you had in mind for Option A, or were you thinking of a different signal source for setting the flag?

@fpenezic, the analysis is sharp and the proposal is sound. A few thoughts:

On the gap in Option A: You're right that it's real, but it's acceptable. In the remote-initiated NAT change where our WGWatcher hasn't fired yet, the worst case is now "delayed recovery (WGWatcher timeout, typically ~3 min)" instead of "indefinite tunnel death." WGWatcher will eventually fire, set the flag, and the next offer cycle recovers. That's a meaningful improvement over the status quo.

Additional stale flag signal: onRelayClientDisconnected → onRelayDisconnected is another natural place to set the flag. When the relay server does send a close event (the happy path), that definitively marks the conn as stale — no WGWatcher timeout needed. This closes part of the gap without adding complexity:

// in onRelayClientDisconnected or wherever onRelayDisconnected is handled w.relayConnStale.Store(true) go w.conn.onRelayDisconnected()

This way the flag is set by three events: our WGWatcher fires, the relay server sends a close, or CloseConn() is called directly.

On the cooldown: If the flag is always cleared atomically inside OnNewOffer immediately after a successful reopen, there's no stuck-flag scenario — the only way it stays true is if OpenConn fails, in which case the conn didn't get re-established anyway. I'd skip the 5s cooldown and rely on compareAndSwap-style clear-on-success instead.

On the per-peer mutex: Worth checking whether OnNewOffer is already serialized by its caller. If the signal handler dispatches offers for a given peer on a single goroutine, the mutex is redundant. If offers from different signal messages can race (e.g., separate goroutines), the mutex is necessary. Do you know the call chain? If it's already serialized, you can drop step 3 entirely and simplify.

Overall the hybrid A approach is the right call — it's correct (no healthy-conn teardown) and recovers from the common case (we detect the disconnect first) without the protocol complexity of Option C.

🧠 Learnings used

Learnt from: fpenezic Repo: netbirdio/netbird PR: 5866 File: client/internal/peer/worker_relay.go:71-80 Timestamp: 2026-04-22T22:05:59.868Z Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` fires on ALL offer/answer exchanges (normal re-keying included), NOT only after Guard detects a disconnect. Unconditionally calling `CloseConnByPeerKey` when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer` causes healthy relay connections to be torn down, producing an infinite reconnect loop (observed at 3600+ iterations/hour in multi-peer scenarios). A proper staleness check (e.g., an explicit stale flag set by WGWatcher before queuing the offer, or a health probe on the existing conn) is required before closing the existing connection.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5826 File: client/internal/portforward/manager.go:35-38 Timestamp: 2026-04-08T14:55:48.251Z Learning: In `client/internal/portforward/manager.go` (netbirdio/netbird), persisting permanent-lease (ttl=0/UPnP error 725 fallback) port mapping state for crash-recovery cleanup is intentionally deferred to a follow-up PR. The blocker is that `State.Cleanup` requires NAT gateway re-discovery via `nat.DiscoverGateway`, which can block startup for ~10 seconds when no gateway is present, affecting all clients. The TODO comment at line ~35 documents this constraint. Do not flag the missing state persistence for permanent leases as a blocking issue in this PR.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5652 File: client/cmd/status.go:221-221 Timestamp: 2026-03-22T18:18:11.890Z Learning: In netbirdio/netbird (`client/cmd/status.go`), `checkStartup` reads relay availability from the status recorder, which is updated in real time by the relay client on connect/disconnect. `ShouldRunProbes` only gates STUN/TURN reachability checks (blocking probes via `ProbeAllWaitResult`). Since `checkStartup` does not evaluate STUN/TURN probe results, passing `ShouldRunProbes=false` for startup health checks is intentional and correct — it avoids blocking latency with no benefit. Do not flag `getStatus(ctx, isStartup, false)` in `runHealthCheck` as a bug.

Learnt from: pappz Repo: netbirdio/netbird PR: 5770 File: relay/server/listener/quic/conn.go:34-38 Timestamp: 2026-04-01T18:47:37.952Z Learning: In netbirdio/netbird relay/server/listener/quic/conn.go, `Conn.Write` intentionally ignores the context parameter (`_ context.Context`) for QUIC `SendDatagram` calls. The design guarantees that the underlying QUIC session is always closed before the associated context is canceled, so `SendDatagram` will return a connection-closed error naturally without requiring a `ctx.Err()` pre-check. Do not flag the unused context in `Conn.Write` as a bug in this codebase.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5668 File: client/firewall/create_linux.go:45-59 Timestamp: 2026-03-23T19:24:39.071Z Learning: In netbirdio/netbird (client/firewall/create_linux.go), `AllowNetbird()` on the native firewall (nftables/iptables) adds blanket accept rules. These rules should ONLY be called when USPFilter wraps the native firewall as a backend (in-process filtering): USPFilter does per-peer ACL enforcement and needs the native firewall to pass all traffic through. When the native firewall is used directly (returned without USPFilter wrapping), `AllowNetbird()` must NOT be called, because the native firewall manages its own ACL rules and blanket accepts would bypass them. Do not flag the absence of `AllowNetbird()` on the native-only return path as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/internals/shared/grpc/conversion.go:0-0 Timestamp: 2026-03-26T09:09:54.859Z Learning: In netbirdio/netbird (management/internals/shared/grpc/conversion.go), the two IPv6 guards in `toPeerConfig` and `ToSyncResponse` serve different purposes and must NOT be unified. `toPeerConfig` uses `peer.IPv6.IsValid() && network.NetV6.IP != nil` because it constructs a prefix from the network mask and needs to guard against a nil mask. `ToSyncResponse` uses `peer.SupportsIPv6() && peer.IPv6.IsValid()` for `includeIPv6` and does not need the network object. Additionally, `peer.IPv6.IsValid() == true` is already proof that the peer is in an `IPv6EnabledGroups` group — IPv6 is only allocated to peers in those groups — so no additional group-membership gate is needed at the gRPC conversion layer. Do not flag these two distinct checks as a unification opportunity.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/server/account.go:2549-2595 Timestamp: 2026-04-08T06:14:34.119Z Learning: In `management/server/account.go` (netbirdio/netbird, PR `#5698`), `updatePeerIPv6InTransaction` intentionally does NOT apply the same host-part filtering that `AllocateRandomPeerIPv6()` uses (e.g., avoiding the all-zero host / subnet-router anycast address per RFC 4291 §2.6.1). IPv6 has no broadcast address, and subnet-router anycast is irrelevant for overlay addresses. The allocator avoids these addresses by convention only; an admin explicitly setting a specific peer IPv6 via the API is permitted to use any valid address within the account's v6 network range. Do not flag the absence of allocator-style host-part validation in `updatePeerIPv6InTransaction` as a bug.

Learnt from: bcmmbaga Repo: netbirdio/netbird PR: 5441 File: management/server/user.go:745-749 Timestamp: 2026-02-24T19:32:13.189Z Learning: In Go codebases like netbirdio/netbird, methods returning (T, error) should follow the convention: if an error is returned, propagate it and return early. When the caller handles the error immediately, explicit nil checks on the returned value are unnecessary; rely on the error to guard control flow and only inspect the value when err is nil. This guidance applies broadly to Store methods such as GetUserByUserID in Go code; prefer early return on error and avoid redundant nil checks after a successful error path.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5530 File: proxy/internal/udp/relay.go:276-278 Timestamp: 2026-03-07T13:59:16.506Z Learning: For Go 1.25+ projects, use sync.WaitGroup.Go() to run goroutines because it handles Add(1) and defer Done() automatically. Do not flag wg.Go(...) as a compilation error in codebases that require Go 1.25 or newer. If supporting older Go versions, gate such usage with build tags or avoid wg.Go() to maintain compatibility; verify the module targets Go 1.25+ before adopting this pattern.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5688 File: client/firewall/uspfilter/forwarder/icmp.go:268-287 Timestamp: 2026-03-25T09:21:53.814Z Learning: When using gVisor’s `header.ICMPv6Checksum(ICMPv6ChecksumParams)` (`gvisor.dev/gvisor/pkg/tcpip/header`), the pseudo-header checksum is computed internally from `Src`, `Dst`, and `Header`. Use `PayloadCsum`/`PayloadLen` only for scattered-buffer cases where the ICMPv6 payload bytes are not included in `Header`. If `Header` already contains the complete ICMPv6 message (ICMPv6 header + all data), call `ICMPv6Checksum` with `PayloadCsum: 0` and `PayloadLen: 0` (or omit those fields). Do not pass a separately computed pseudo-header checksum (e.g., `header.PseudoHeaderChecksum(...)`) as `PayloadCsum` in that case, because it will be double-counted and produce an incorrect checksum.

Verified the call chain - AsyncOfferListener already serializes per-peer (one goroutine at a time; newer offers during processing just update latest and run after current completes). So step 3 is redundant as you suspected. Dropping the mutex.

Final plan:

relayConnStale atomic.Bool on WorkerRelay

Set in three places:

Conn.onWGDisconnected (priority == Relay branch) - local WG watcher fired

onRelayClientDisconnected - relay server sent close event

WorkerRelay.CloseConn() - any explicit teardown

OnNewOffer: close-and-retry only when flag is set, clear on successful reopen (compareAndSwap)

No cooldown, no mutex.

The tradeoff: remote-initiated NAT change where our WG watcher hasn't fired yet means delayed recovery (~WG handshake timeout) instead of tunnel death. Net improvement over current master. Pushing shortly.

@fpenezic, the plan is solid. A few notes for the implementation:

On the CAS: Since AsyncOfferListener already serializes OnNewOffer per-peer, there's no concurrent accessor on relayConnStale from that direction — a plain Load() / Store(false) is race-free and equivalent to CAS here. The only concurrent writers are onRelayClientDisconnected (goroutine) and CloseConn() (called from various places), but those are write-only (always Store(true)), so a torn read isn't possible. No need for the overhead of a CAS loop.

One edge case to double-check: onRelayClientDisconnected spawns go w.conn.onRelayDisconnected() — so the flag set and the offer dispatch are on separate goroutines. Since AsyncOfferListener already handles the "offer arrives before flag is set" race by queuing and re-running, this shouldn't be an issue, but worth a quick trace to confirm the ordering is guaranteed on that path.

CloseConn() as a signal source: Note that CloseConn() is also called inside onRelayConnectionIsReady if AddCloseListener fails. That path will correctly mark the conn stale before returning, which is the right behavior.

Overall the plan is correct and the tradeoff (delayed recovery on remote-side NAT change we haven't observed yet, vs. infinite reconnect loop) is clearly the right call. Looking forward to the push.

🧠 Learnings used

Learnt from: fpenezic Repo: netbirdio/netbird PR: 5866 File: client/internal/peer/worker_relay.go:71-80 Timestamp: 2026-04-22T22:05:59.868Z Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` fires on ALL offer/answer exchanges (normal re-keying included), NOT only after Guard detects a disconnect. Unconditionally calling `CloseConnByPeerKey` when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer` causes healthy relay connections to be torn down, producing an infinite reconnect loop (observed at 3600+ iterations/hour in multi-peer scenarios). A proper staleness check (e.g., an explicit stale flag set by WGWatcher before queuing the offer, or a health probe on the existing conn) is required before closing the existing connection.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5826 File: client/internal/portforward/manager.go:35-38 Timestamp: 2026-04-08T14:55:48.251Z Learning: In `client/internal/portforward/manager.go` (netbirdio/netbird), persisting permanent-lease (ttl=0/UPnP error 725 fallback) port mapping state for crash-recovery cleanup is intentionally deferred to a follow-up PR. The blocker is that `State.Cleanup` requires NAT gateway re-discovery via `nat.DiscoverGateway`, which can block startup for ~10 seconds when no gateway is present, affecting all clients. The TODO comment at line ~35 documents this constraint. Do not flag the missing state persistence for permanent leases as a blocking issue in this PR.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5652 File: client/cmd/status.go:221-221 Timestamp: 2026-03-22T18:18:11.890Z Learning: In netbirdio/netbird (`client/cmd/status.go`), `checkStartup` reads relay availability from the status recorder, which is updated in real time by the relay client on connect/disconnect. `ShouldRunProbes` only gates STUN/TURN reachability checks (blocking probes via `ProbeAllWaitResult`). Since `checkStartup` does not evaluate STUN/TURN probe results, passing `ShouldRunProbes=false` for startup health checks is intentional and correct — it avoids blocking latency with no benefit. Do not flag `getStatus(ctx, isStartup, false)` in `runHealthCheck` as a bug.

Learnt from: pappz Repo: netbirdio/netbird PR: 5770 File: relay/server/listener/quic/conn.go:34-38 Timestamp: 2026-04-01T18:47:37.952Z Learning: In netbirdio/netbird relay/server/listener/quic/conn.go, `Conn.Write` intentionally ignores the context parameter (`_ context.Context`) for QUIC `SendDatagram` calls. The design guarantees that the underlying QUIC session is always closed before the associated context is canceled, so `SendDatagram` will return a connection-closed error naturally without requiring a `ctx.Err()` pre-check. Do not flag the unused context in `Conn.Write` as a bug in this codebase.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5668 File: client/firewall/create_linux.go:45-59 Timestamp: 2026-03-23T19:24:39.071Z Learning: In netbirdio/netbird (client/firewall/create_linux.go), `AllowNetbird()` on the native firewall (nftables/iptables) adds blanket accept rules. These rules should ONLY be called when USPFilter wraps the native firewall as a backend (in-process filtering): USPFilter does per-peer ACL enforcement and needs the native firewall to pass all traffic through. When the native firewall is used directly (returned without USPFilter wrapping), `AllowNetbird()` must NOT be called, because the native firewall manages its own ACL rules and blanket accepts would bypass them. Do not flag the absence of `AllowNetbird()` on the native-only return path as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/internals/shared/grpc/conversion.go:0-0 Timestamp: 2026-03-26T09:09:54.859Z Learning: In netbirdio/netbird (management/internals/shared/grpc/conversion.go), the two IPv6 guards in `toPeerConfig` and `ToSyncResponse` serve different purposes and must NOT be unified. `toPeerConfig` uses `peer.IPv6.IsValid() && network.NetV6.IP != nil` because it constructs a prefix from the network mask and needs to guard against a nil mask. `ToSyncResponse` uses `peer.SupportsIPv6() && peer.IPv6.IsValid()` for `includeIPv6` and does not need the network object. Additionally, `peer.IPv6.IsValid() == true` is already proof that the peer is in an `IPv6EnabledGroups` group — IPv6 is only allocated to peers in those groups — so no additional group-membership gate is needed at the gRPC conversion layer. Do not flag these two distinct checks as a unification opportunity.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/server/account.go:2549-2595 Timestamp: 2026-04-08T06:14:34.119Z Learning: In `management/server/account.go` (netbirdio/netbird, PR `#5698`), `updatePeerIPv6InTransaction` intentionally does NOT apply the same host-part filtering that `AllocateRandomPeerIPv6()` uses (e.g., avoiding the all-zero host / subnet-router anycast address per RFC 4291 §2.6.1). IPv6 has no broadcast address, and subnet-router anycast is irrelevant for overlay addresses. The allocator avoids these addresses by convention only; an admin explicitly setting a specific peer IPv6 via the API is permitted to use any valid address within the account's v6 network range. Do not flag the absence of allocator-style host-part validation in `updatePeerIPv6InTransaction` as a bug.

Learnt from: bcmmbaga Repo: netbirdio/netbird PR: 5441 File: management/server/user.go:745-749 Timestamp: 2026-02-24T19:32:13.189Z Learning: In Go codebases like netbirdio/netbird, methods returning (T, error) should follow the convention: if an error is returned, propagate it and return early. When the caller handles the error immediately, explicit nil checks on the returned value are unnecessary; rely on the error to guard control flow and only inspect the value when err is nil. This guidance applies broadly to Store methods such as GetUserByUserID in Go code; prefer early return on error and avoid redundant nil checks after a successful error path.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5530 File: proxy/internal/udp/relay.go:276-278 Timestamp: 2026-03-07T13:59:16.506Z Learning: For Go 1.25+ projects, use sync.WaitGroup.Go() to run goroutines because it handles Add(1) and defer Done() automatically. Do not flag wg.Go(...) as a compilation error in codebases that require Go 1.25 or newer. If supporting older Go versions, gate such usage with build tags or avoid wg.Go() to maintain compatibility; verify the module targets Go 1.25+ before adopting this pattern.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5688 File: client/firewall/uspfilter/forwarder/icmp.go:268-287 Timestamp: 2026-03-25T09:21:53.814Z Learning: When using gVisor’s `header.ICMPv6Checksum(ICMPv6ChecksumParams)` (`gvisor.dev/gvisor/pkg/tcpip/header`), the pseudo-header checksum is computed internally from `Src`, `Dst`, and `Header`. Use `PayloadCsum`/`PayloadLen` only for scattered-buffer cases where the ICMPv6 payload bytes are not included in `Header`. If `Header` already contains the complete ICMPv6 message (ICMPv6 header + all data), call `ICMPv6Checksum` with `PayloadCsum: 0` and `PayloadLen: 0` (or omit those fields). Do not pass a separately computed pseudo-header checksum (e.g., `header.PseudoHeaderChecksum(...)`) as `PayloadCsum` in that case, because it will be double-counted and produce an incorrect checksum.

fpenezic · 2026-04-22T22:04:12Z

Converting to draft - I discovered a regression during extended testing.

Bug: The ErrConnAlreadyExists retry loop can fire repeatedly (observed 3600+ times over ~1 hour) in multi-peer scenarios. When remote peers send offers in rapid succession (session renewal, ICE state changes, etc. - not only after a disconnect), each offer triggers close + reopen, and the remote side sees the reconnect as another state change and sends a fresh offer, producing an infinite tear-down/rebuild loop. Each iteration allocates a new wgProxy port on 127.0.0.1 while the previous one fails with use of closed network connection, generating significant background traffic (observed 9 Mbit/s constant on the affected peer).

Root cause: my earlier reasoning that OnNewOffer is only invoked after Guard detects a disconnect was incorrect. OnNewOffer also fires on normal offer/answer exchanges, so unconditionally treating ErrConnAlreadyExists as "stale by contract" is wrong - the original CodeRabbit concern about healthy connections being torn down was valid.

Plan: Add a safeguard so the close-and-retry path only runs when the existing connection is actually stale, e.g.:

Cooldown: skip close if the existing conn was (re)created within the last N seconds
Liveness check via WGWatcher handshake state
Per-peer serialization of OnNewOffer so rapid back-to-back offers don't overlap

Will push a follow-up commit once I have verified the fix in the same reproduction environment (PPPoE reconnect + IP rotation).

…tunnel after NAT IP change

…vers

The previous fix unconditionally closed and reopened the existing relay conn whenever OnNewOffer hit ErrConnAlreadyExists. That caused an infinite tear-down/rebuild loop when the remote peer sent rapid successive offers (e.g. during reconnection): close → reopen → remote sees teardown → sends new offer → ErrConnAlreadyExists → close → loop. Introduce a relayConnStale atomic flag that is set only when an event indicates the existing entry is no longer backed by a live peer session: local WG handshake timeout, relay server close, or explicit CloseConn. OnNewOffer only tears down and reopens when the flag is set; otherwise it reuses the existing healthy connection. Signal sources for the flag: - conn.onWGDisconnected (Relay path) → MarkStale before CloseConn - WorkerRelay.CloseConn → marks stale on close - WorkerRelay.onRelayClientDisconnected → marks stale on server close

When WireGuard handshake times out while running over ICE, onWGDisconnected calls workerICE.Close(). Close() sets w.agent=nil synchronously before pion's ICE library fires the asynchronous ConnectionStateClosed event. By the time onConnectionStateChange runs closeAgent(), the `w.agent == agent` guard fails (w.agent is already nil) and the session ID is not rotated. Without session ID rotation, the next ICE offer carries the same local session ID. The remote peer in OnNewOffer compares remoteSessionID against the incoming offer's sessionID, finds them equal, and skips agent recreation — reusing the existing agent and its stale candidates from the broken network state. Observed in production: 30s reconnect loop after a NAT/conntrack event with the same offer session ID logged ~20+ times in a row, WG handshake never recovering despite "ICE connection succeeded" on every cycle. Mirror the existing Relay-path behavior: explicitly rotate the session ID before closing the ICE worker so the remote peer always recreates its agent after a WG timeout on ICE.

sonarqubecloud · 2026-04-25T09:32:57Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread shared/relay/client/manager.go

pappz self-requested a review April 13, 2026 07:53

fpenezic force-pushed the fix/stale-relay-conn-recovery branch from a9783e0 to 7d55ac2 Compare April 20, 2026 21:19

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

fpenezic marked this pull request as draft April 22, 2026 22:02

fpenezic added 5 commits April 25, 2026 10:52

fix: close stale relay connection on ErrConnAlreadyExists to recover …

46a5a7d

…tunnel after NAT IP change

fix: route CloseConnByPeerKey to correct relay client for foreign ser…

0f1fc4f

…vers

docs: explain Guard invariant for ErrConnAlreadyExists branch

159183b

fpenezic force-pushed the fix/stale-relay-conn-recovery branch from b0b9b85 to c0b3e4e Compare April 25, 2026 09:32

Uh oh!

Conversation

fpenezic commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem 1 - Relay path: stale ErrConnAlreadyExists reuse

Problem 2 - ICE path: session ID not rotated on WG timeout

Test plan

Notes

Uh oh!

CLAassistant commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fpenezic Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

fpenezic Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

fpenezic Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

fpenezic Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

fpenezic commented Apr 22, 2026

Uh oh!

sonarqubecloud Bot commented Apr 25, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fpenezic commented Apr 12, 2026 •

edited

Loading

Problem 1 - Relay path: stale `ErrConnAlreadyExists` reuse

CLAassistant commented Apr 12, 2026 •

edited

Loading

coderabbitai Bot commented Apr 12, 2026 •

edited

Loading

coderabbitai Bot Apr 20, 2026 •

edited

Loading