fix: close stale relay connection on ErrConnAlreadyExists to recover …#5866
fix: close stale relay connection on ErrConnAlreadyExists to recover …#5866fpenezic wants to merge 5 commits intonetbirdio:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds WorkerICE.ResetSessionID(), calls it when a relay-backed WireGuard connection disconnects, and introduces CloseConnByPeerKey(...) in relay client/manager plus eviction-and-retry logic when OpenConn returns ErrConnAlreadyExists. Changes
Sequence Diagram(s)sequenceDiagram
participant Conn
participant WorkerRelay
participant RelayClient
participant WorkerICE
Conn->>WorkerRelay: onWGDisconnected (currentConnPriority == Relay)
WorkerRelay->>RelayClient: request CloseConnByPeerKey(peerKey)
RelayClient-->>WorkerRelay: removed/closed ack
WorkerRelay->>WorkerICE: ResetSessionID()
WorkerICE-->>WorkerRelay: sessionID refreshed, remoteSessionID cleared
WorkerRelay->>Conn: continue disconnect handling
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@shared/relay/client/manager.go`:
- Around line 150-160: The CloseConnByPeerKey method currently always calls
m.relayClient.CloseConnByPeerKey(peerKey) which misses stale connections stored
under a specific server entry; modify Manager.CloseConnByPeerKey to
accept/lookup the serverAddress (use m.relayClients[srv].relayClient when
present) and call that relayClient.CloseConnByPeerKey(peerKey) instead of always
using m.relayClient; ensure the logic still handles the nil/default
m.relayClient case and preserves the mutex usage around m.relayClients access;
also update the caller in client/internal/peer/worker_relay.go to pass the srv
value into CloseConnByPeerKey so cross-relay ErrConnAlreadyExists recovery finds
and removes the stale entry created in openConnVia/OpenConn.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dcc48e08-b6cd-48da-ad60-56304d5f8378
📒 Files selected for processing (5)
client/internal/peer/conn.goclient/internal/peer/worker_ice.goclient/internal/peer/worker_relay.goshared/relay/client/client.goshared/relay/client/manager.go
a9783e0 to
7d55ac2
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/peer/worker_relay.go`:
- Around line 66-75: The code currently treats relayClient.ErrConnAlreadyExists
as proof the existing relay is stale and unconditionally calls
w.relayManager.CloseConnByPeerKey; instead, check the existing connection's
health/state before closing or prevent concurrent OnNewOffer for the same peer.
Modify the branch handling relayClient.ErrConnAlreadyExists to first query the
existing connection via w.relayManager (e.g., a method like GetConnByPeerKey or
HasActiveConn) and only call w.relayManager.CloseConnByPeerKey when that
connection reports closed/stale (or fail a liveness check); otherwise reuse the
active connection or bail out/serialize the new offer (e.g., with a per-peer
lock) rather than tearing down a healthy relay, keeping the subsequent OpenConn
call only when a close was actually performed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 271f46a4-0e60-4871-8a37-51c6772eb90e
📒 Files selected for processing (5)
client/internal/peer/conn.goclient/internal/peer/worker_ice.goclient/internal/peer/worker_relay.goshared/relay/client/client.goshared/relay/client/manager.go
🚧 Files skipped from review as they are similar to previous changes (3)
- client/internal/peer/conn.go
- shared/relay/client/manager.go
- shared/relay/client/client.go
| if errors.Is(err, relayClient.ErrConnAlreadyExists) { | ||
| w.log.Debugf("handled offer by reusing existing relay connection") | ||
| w.log.Infof("relay conn already exists, closing stale conn and retrying") | ||
| w.relayManager.CloseConnByPeerKey(srv, w.config.Key) | ||
| relayedConn, err = w.relayManager.OpenConn(w.peerCtx, srv, w.config.Key) | ||
| if err != nil { | ||
| w.log.Errorf("failed to reopen connection via Relay after closing stale: %s", err) | ||
| return | ||
| } | ||
| } else { | ||
| w.log.Errorf("failed to open connection via Relay: %s", err) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd -t f "worker_relay.go" --exec head -100 {}Repository: netbirdio/netbird
Length of output: 2725
🏁 Script executed:
fd -t f "client.go" -path "*/relay/client/*" --exec wc -l {}Repository: netbirdio/netbird
Length of output: 229
🏁 Script executed:
rg "ErrConnAlreadyExists" -A 3 -B 3Repository: netbirdio/netbird
Length of output: 1384
🏁 Script executed:
fd -t f "worker_relay.go" -exec cat -n {} \;Repository: netbirdio/netbird
Length of output: 291
🏁 Script executed:
fd -path "*/relay/client/client.go"Repository: netbirdio/netbird
Length of output: 229
🏁 Script executed:
fd -path "*/relay/client/client.go" -exec wc -l {} \;Repository: netbirdio/netbird
Length of output: 229
🏁 Script executed:
fd -path "*/relay/client/client.go" -exec head -200 {} \;Repository: netbirdio/netbird
Length of output: 229
🏁 Script executed:
rg "func.*Manager.*OpenConn" -A 20Repository: netbirdio/netbird
Length of output: 1284
🏁 Script executed:
rg "CloseConnByPeerKey" -A 10 -B 3Repository: netbirdio/netbird
Length of output: 4079
🏁 Script executed:
rg "OnNewOffer" -B 5 -A 2Repository: netbirdio/netbird
Length of output: 2252
🏁 Script executed:
rg "type WorkerRelay" -A 50 | head -80Repository: netbirdio/netbird
Length of output: 3608
🏁 Script executed:
rg "func.*onRelayConnectionIsReady" -A 15Repository: netbirdio/netbird
Length of output: 979
🏁 Script executed:
rg "type Handshaker" -A 30Repository: netbirdio/netbird
Length of output: 2553
🏁 Script executed:
rg "AddRelayListener" -B 5 -A 10Repository: netbirdio/netbird
Length of output: 3701
🏁 Script executed:
rg "relayListeners" -B 3 -A 10Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
rg "type AsyncOfferListener" -A 40Repository: netbirdio/netbird
Length of output: 2686
🏁 Script executed:
rg "NewAsyncOfferListener" -A 30Repository: netbirdio/netbird
Length of output: 5775
🏁 Script executed:
rg "OnRemoteOffer" -B 3 -A 10Repository: netbirdio/netbird
Length of output: 4594
🏁 Script executed:
cat -n client/internal/peer/handshaker_listener.goRepository: netbirdio/netbird
Length of output: 1653
🏁 Script executed:
rg "CloseConnByPeerKey" --type go -B 2 -A 8Repository: netbirdio/netbird
Length of output: 3587
🏁 Script executed:
rg "onRelayClientDisconnected" -B 3 -A 15Repository: netbirdio/netbird
Length of output: 1652
🏁 Script executed:
rg "func.*onRelayConnectionIsReady" -B 2 -A 20Repository: netbirdio/netbird
Length of output: 1331
🏁 Script executed:
rg "relayLock" -B 2 -A 2Repository: netbirdio/netbird
Length of output: 1090
🏁 Script executed:
rg "type.*Conn.*struct" -A 20 shared/relay/client/client.go | head -80Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
rg "conns\[" shared/relay/client/client.go -B 2 -A 2Repository: netbirdio/netbird
Length of output: 1303
🏁 Script executed:
rg "delete\(c\.conns" shared/relay/client/client.go -B 5 -A 2Repository: netbirdio/netbird
Length of output: 1092
🏁 Script executed:
rg "type connContainer" shared/relay/client/client.go -A 30Repository: netbirdio/netbird
Length of output: 911
🏁 Script executed:
rg "type Conn struct" shared/relay/client/client.go -A 20Repository: netbirdio/netbird
Length of output: 43
🏁 Script executed:
rg "func.*Close\(\)" client/internal/peer/worker_relay.go -A 15Repository: netbirdio/netbird
Length of output: 43
ErrConnAlreadyExists does not prove a relay connection is stale.
This error only indicates an entry exists in the relay client's connection map; it carries no information about connection health or staleness. With the unconditional CloseConnByPeerKey call, a healthy relay connection from a prior handshake can be forcibly torn down when a new offer arrives and encounters the same peerID already in the map. Add a disconnected/stale state check before closing, or ensure OnNewOffer cannot run while a live relay connection for this peer exists.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@client/internal/peer/worker_relay.go` around lines 66 - 75, The code
currently treats relayClient.ErrConnAlreadyExists as proof the existing relay is
stale and unconditionally calls w.relayManager.CloseConnByPeerKey; instead,
check the existing connection's health/state before closing or prevent
concurrent OnNewOffer for the same peer. Modify the branch handling
relayClient.ErrConnAlreadyExists to first query the existing connection via
w.relayManager (e.g., a method like GetConnByPeerKey or HasActiveConn) and only
call w.relayManager.CloseConnByPeerKey when that connection reports closed/stale
(or fail a liveness check); otherwise reuse the active connection or bail
out/serialize the new offer (e.g., with a per-peer lock) rather than tearing
down a healthy relay, keeping the subsequent OpenConn call only when a close was
actually performed.
There was a problem hiding this comment.
Thanks for flagging. The behavior is intentional - reasoning:
WorkerRelay.OnNewOffer is only invoked when the remote peer sends a fresh offer through the Guard component. Guard only sends offers after it has detected a disconnect (WG handshake timeout, ICE disconnect, etc.). If we're receiving a new offer, the remote peer already believes the existing connection is broken.
Given that:
-
Remote peer has effectively invalidated the connection by initiating re-handshake. Reusing our side's stale conn entry (the original bug) leaves WG packets going into a dead pipe. Tearing it down and opening a fresh one matches the remote peer's state.
-
ErrConnAlreadyExistsafter an offer is the pathological case we're fixing. Under normal operation,OpenConndoesn't collide - each offer/answer cycle happens on a fresh session. The only time we hitErrConnAlreadyExistsis when the map wasn't cleaned up (e.g. remote NAT IP change without a relay-level close event reaching us). -
Even in the theoretical race (both peers send offers simultaneously and one just established a healthy conn that we then close), the cost is one extra round-trip: the next offer will re-establish. Compare to the current failure mode: indefinite tunnel death until manual restart.
-
A liveness check on the relay conn isn't straightforward - the connection is a proxied UDP pipe with no heartbeat of its own. The authoritative liveness signal is the WG handshake, which
WGWatcheralready owns. Adding a separate liveness probe on the relay conn would duplicate that logic.
If the concern is the simultaneous-offer race, a cleaner guard would be per-peer serialization of OnNewOffer - but that's orthogonal and the current race-recovery cost (one extra offer cycle) seems acceptable vs. the bug being fixed.
Can add a code comment explaining this if that would help future readers.
There was a problem hiding this comment.
@fpenezic, thank you for the thorough explanation — the reasoning is solid and I withdraw the concern.
The key invariant I missed is that OnNewOffer is gated by Guard's disconnect detection, so by the time a new offer arrives with a colliding peerID, the existing relay conn is stale by contract, not just by inference. The simultaneous-offer edge case has a bounded, self-healing cost (one extra offer cycle) which is clearly preferable to indefinite tunnel death.
A brief inline comment explaining the Guard invariant would definitely help future readers avoid the same confusion — something like:
// ErrConnAlreadyExists here means the conn map was not cleaned up after the
// previous session (e.g. remote NAT IP change with no relay-level close event).
// Guard only triggers OnNewOffer after detecting a disconnect, so the existing
// entry is stale by contract. Close it and open a fresh conn to match the
// remote peer's state.
if errors.Is(err, relayClient.ErrConnAlreadyExists) {(ᴗ˳ᴗ)
✏️ Learnings added
Learnt from: fpenezic
Repo: netbirdio/netbird PR: 5866
File: client/internal/peer/worker_relay.go:66-75
Timestamp: 2026-04-21T06:14:04.853Z
Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` is only invoked by the Guard component after it has detected a disconnect (WG handshake timeout, ICE disconnect, etc.). Therefore, when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer`, the existing relay conn entry is stale by contract — Guard serializes offers post-disconnect, so the collision cannot be from a racing healthy connection. Unconditionally calling `CloseConnByPeerKey` and retrying `OpenConn` is the correct and intentional behavior. Do not flag this as a potential teardown of a healthy relay connection.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5826
File: client/internal/portforward/manager.go:35-38
Timestamp: 2026-04-08T14:55:48.251Z
Learning: In `client/internal/portforward/manager.go` (netbirdio/netbird), persisting permanent-lease (ttl=0/UPnP error 725 fallback) port mapping state for crash-recovery cleanup is intentionally deferred to a follow-up PR. The blocker is that `State.Cleanup` requires NAT gateway re-discovery via `nat.DiscoverGateway`, which can block startup for ~10 seconds when no gateway is present, affecting all clients. The TODO comment at line ~35 documents this constraint. Do not flag the missing state persistence for permanent leases as a blocking issue in this PR.
Learnt from: pappz
Repo: netbirdio/netbird PR: 5770
File: relay/server/listener/quic/conn.go:34-38
Timestamp: 2026-04-01T18:47:37.952Z
Learning: In netbirdio/netbird relay/server/listener/quic/conn.go, `Conn.Write` intentionally ignores the context parameter (`_ context.Context`) for QUIC `SendDatagram` calls. The design guarantees that the underlying QUIC session is always closed before the associated context is canceled, so `SendDatagram` will return a connection-closed error naturally without requiring a `ctx.Err()` pre-check. Do not flag the unused context in `Conn.Write` as a bug in this codebase.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5652
File: client/cmd/status.go:221-221
Timestamp: 2026-03-22T18:18:11.890Z
Learning: In netbirdio/netbird (`client/cmd/status.go`), `checkStartup` reads relay availability from the status recorder, which is updated in real time by the relay client on connect/disconnect. `ShouldRunProbes` only gates STUN/TURN reachability checks (blocking probes via `ProbeAllWaitResult`). Since `checkStartup` does not evaluate STUN/TURN probe results, passing `ShouldRunProbes=false` for startup health checks is intentional and correct — it avoids blocking latency with no benefit. Do not flag `getStatus(ctx, isStartup, false)` in `runHealthCheck` as a bug.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5530
File: management/internals/modules/reverseproxy/service/manager/manager.go:302-317
Timestamp: 2026-03-07T14:48:20.330Z
Learning: In `management/internals/modules/reverseproxy/service/manager/manager.go`, `checkPortConflict` uses `LockingStrengthUpdate` (SELECT FOR UPDATE) to serialize on existing port rows. For an unused port (no existing rows to lock), two concurrent transactions could both pass the DB check, but the proxy's OS-level port binding acts as the hard uniqueness guard — only one process can bind a given port. This dual-layer enforcement (DB best-effort + proxy binding) is the accepted design; no additional DB uniqueness constraint is needed.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5698
File: management/server/account.go:2549-2595
Timestamp: 2026-04-08T06:14:34.119Z
Learning: In `management/server/account.go` (netbirdio/netbird, PR `#5698`), `updatePeerIPv6InTransaction` intentionally does NOT apply the same host-part filtering that `AllocateRandomPeerIPv6()` uses (e.g., avoiding the all-zero host / subnet-router anycast address per RFC 4291 §2.6.1). IPv6 has no broadcast address, and subnet-router anycast is irrelevant for overlay addresses. The allocator avoids these addresses by convention only; an admin explicitly setting a specific peer IPv6 via the API is permitted to use any valid address within the account's v6 network range. Do not flag the absence of allocator-style host-part validation in `updatePeerIPv6InTransaction` as a bug.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5698
File: management/internals/shared/grpc/conversion.go:0-0
Timestamp: 2026-03-26T09:09:54.859Z
Learning: In netbirdio/netbird (management/internals/shared/grpc/conversion.go), the two IPv6 guards in `toPeerConfig` and `ToSyncResponse` serve different purposes and must NOT be unified. `toPeerConfig` uses `peer.IPv6.IsValid() && network.NetV6.IP != nil` because it constructs a prefix from the network mask and needs to guard against a nil mask. `ToSyncResponse` uses `peer.SupportsIPv6() && peer.IPv6.IsValid()` for `includeIPv6` and does not need the network object. Additionally, `peer.IPv6.IsValid() == true` is already proof that the peer is in an `IPv6EnabledGroups` group — IPv6 is only allocated to peers in those groups — so no additional group-membership gate is needed at the gRPC conversion layer. Do not flag these two distinct checks as a unification opportunity.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5668
File: client/firewall/create_linux.go:45-59
Timestamp: 2026-03-23T19:24:39.071Z
Learning: In netbirdio/netbird (client/firewall/create_linux.go), `AllowNetbird()` on the native firewall (nftables/iptables) adds blanket accept rules. These rules should ONLY be called when USPFilter wraps the native firewall as a backend (in-process filtering): USPFilter does per-peer ACL enforcement and needs the native firewall to pass all traffic through. When the native firewall is used directly (returned without USPFilter wrapping), `AllowNetbird()` must NOT be called, because the native firewall manages its own ACL rules and blanket accepts would bypass them. Do not flag the absence of `AllowNetbird()` on the native-only return path as a bug.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5687
File: client/internal/engine_ssh.go:44-50
Timestamp: 2026-03-25T06:24:52.840Z
Learning: In `client/internal/engine_ssh.go` (netbirdio/netbird), the IPv6 SSH setup steps — `AddListener` and `AddInboundDNAT` for the IPv6 overlay address — are intentionally best-effort and independent of each other, mirroring the v4 behavior. IPv6 soft-failure is by design so that IPv4 SSH continues to work when IPv6 cannot be set up. Do not flag the lack of atomicity between the IPv6 listener and IPv6 DNAT as a bug.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5926
File: client/internal/sleep/detector_darwin.go:147-213
Timestamp: 2026-04-20T05:15:45.913Z
Learning: In netbirdio/netbird, `SleepHandler` (client/server/sleep.go) already serializes sleep/wake lifecycle events, so a per-`Detector` callback mutex in `client/internal/sleep/detector_darwin.go` is intentionally omitted as redundant. Do not flag concurrent callback invocations on `Detector.triggerCallback` as a bug; the serialization guarantee lives in the layer above.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5530
File: proxy/server.go:930-941
Timestamp: 2026-03-07T13:25:18.409Z
Learning: In proxy/server.go, `modifyMapping()` intentionally removes old routes before setting up new ones. Old routes hold the port listener, so attempting to bind the same port without first releasing it would result in an address-in-use error. The brief service interruption during modify is an accepted design trade-off, not a bug.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5202
File: client/internal/portforward/manager.go:219-222
Timestamp: 2026-01-30T14:43:13.106Z
Learning: In client/internal/portforward/manager.go, the state persistence pattern is designed for retry-on-failure: if DeletePortMapping fails in Stop(), the state is intentionally left intact (via early return) so that cleanupResidual() can attempt deletion again on the next startup. This is the intended behavior, not a bug.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5688
File: client/firewall/uspfilter/filter.go:1281-1292
Timestamp: 2026-03-25T09:21:25.418Z
Learning: In netbirdio/netbird (`client/firewall/uspfilter/filter.go`), IPv4 and IPv6 fragments are intentionally allowed through (fast-path `return false` in `filterInbound`) without going through `peerACLsBlock`, `routeACLsPass`, or the forwarder. This is pre-existing behaviour tracked by the TODO comment "pass fragments of routed packets to forwarder". Do not flag the IPv6 fragment fast-path (`d.ip6.NextHeader == layers.IPProtocolIPv6Fragment` returning `(true, true)`) as a security regression; it mirrors the existing IPv4 fragment handling by design.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5675
File: client/iface/device/wg_link_freebsd.go:61-84
Timestamp: 2026-03-24T11:39:03.789Z
Learning: In `client/iface/device/wg_link_freebsd.go` (and related platform files), `UpdateAddr` is never called mid-lifetime by the engine. IPv6 overlay address changes are detected in `engine.go` via `hasIPv6Changed` and trigger `ErrResetConnection`, which causes a full client restart and interface recreation from scratch. Therefore, `assignAddr` always runs on a fresh interface, and there is no need to clean up pre-existing OS inet6 aliases in the soft-fail (ClearIPv6) path.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5587
File: shared/management/http/api/openapi.yml:3151-3171
Timestamp: 2026-03-13T06:22:59.436Z
Learning: Repo netbirdio/netbird PR `#5587`: For reverse proxy header authentication, HeaderAuthConfig.value is required in requests and cleared in responses. On update, Manager.preserveExistingAuthSecrets() calls preserveHeaderAuthHashes() before validateHeaderAuthValues() (management/internals/modules/reverseproxy/service/manager/manager.go Lines ~555–597), so sending value: "" preserves the existing hashed secret; providing a non-empty value rotates it.
Learnt from: bcmmbaga
Repo: netbirdio/netbird PR: 5441
File: management/server/user.go:745-749
Timestamp: 2026-02-24T19:32:13.189Z
Learning: In Go codebases like netbirdio/netbird, methods returning (T, error) should follow the convention: if an error is returned, propagate it and return early. When the caller handles the error immediately, explicit nil checks on the returned value are unnecessary; rely on the error to guard control flow and only inspect the value when err is nil. This guidance applies broadly to Store methods such as GetUserByUserID in Go code; prefer early return on error and avoid redundant nil checks after a successful error path.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5530
File: proxy/internal/udp/relay.go:276-278
Timestamp: 2026-03-07T13:59:16.506Z
Learning: For Go 1.25+ projects, use sync.WaitGroup.Go() to run goroutines because it handles Add(1) and defer Done() automatically. Do not flag wg.Go(...) as a compilation error in codebases that require Go 1.25 or newer. If supporting older Go versions, gate such usage with build tags or avoid wg.Go() to maintain compatibility; verify the module targets Go 1.25+ before adopting this pattern.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5688
File: client/firewall/uspfilter/forwarder/icmp.go:268-287
Timestamp: 2026-03-25T09:21:53.814Z
Learning: When using gVisor’s `header.ICMPv6Checksum(ICMPv6ChecksumParams)` (`gvisor.dev/gvisor/pkg/tcpip/header`), the pseudo-header checksum is computed internally from `Src`, `Dst`, and `Header`. Use `PayloadCsum`/`PayloadLen` only for scattered-buffer cases where the ICMPv6 payload bytes are not included in `Header`. If `Header` already contains the complete ICMPv6 message (ICMPv6 header + all data), call `ICMPv6Checksum` with `PayloadCsum: 0` and `PayloadLen: 0` (or omit those fields). Do not pass a separately computed pseudo-header checksum (e.g., `header.PseudoHeaderChecksum(...)`) as `PayloadCsum` in that case, because it will be double-counted and produce an incorrect checksum.
There was a problem hiding this comment.
Reopening this - my "Guard invariant" argument was wrong. OnNewOffer fires on normal offer/answer exchanges too, not only after disconnect detection. Extended testing showed an infinite reconnect loop (3600+ iterations/hour) in multi-peer scenarios where your original concern manifested exactly as you described - healthy connections being torn down.
Converting to draft while I add proper staleness detection (cooldown or liveness check). Thanks for catching this early - the "bounded, self-healing cost" claim I made was also wrong, the cost compounds into a persistent loop.
There was a problem hiding this comment.
@fpenezic, thanks for following up — glad you caught this before it landed. Removing the incorrect learning now.
For staleness detection, a few concrete approaches to consider:
Option A — Explicit stale flag (cleanest)
Add an atomic.Bool relayConnStale field to WorkerRelay. WGWatcher/Guard sets it true when it detects a disconnect before queuing the new offer. OnNewOffer checks the flag and only calls CloseConnByPeerKey when it is set, then clears it. This makes the "stale by contract" invariant explicit in code rather than assumed.
Option B — Health probe on the existing conn
Before closing, attempt a zero-byte write or read with a short deadline on the existing relayedConn. If it fails immediately, it's stale. Adds a small RTT delay on the unhappy path but avoids tearing down healthy connections.
Option C — Session generation counter
Increment a generation counter each time Guard detects a disconnect. Pass it through the offer. OnNewOffer closes the existing conn only when the generation in the offer is newer than what was used to open the current conn. Requires a protocol change but is the most robust.
Option A is probably the least invasive — it keeps all the new plumbing inside WorkerRelay/Conn without touching the relay client or the wire protocol.
✏️ Learnings added
Learnt from: fpenezic
Repo: netbirdio/netbird PR: 5866
File: client/internal/peer/worker_relay.go:71-80
Timestamp: 2026-04-22T22:05:59.868Z
Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` fires on ALL offer/answer exchanges (normal re-keying included), NOT only after Guard detects a disconnect. Unconditionally calling `CloseConnByPeerKey` when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer` causes healthy relay connections to be torn down, producing an infinite reconnect loop (observed at 3600+ iterations/hour in multi-peer scenarios). A proper staleness check (e.g., an explicit stale flag set by WGWatcher before queuing the offer, or a health probe on the existing conn) is required before closing the existing connection.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
⛔ Learnings removed
Learnt from: fpenezic
Repo: netbirdio/netbird PR: 5866
File: client/internal/peer/worker_relay.go:66-75
Timestamp: 2026-04-21T06:14:08.311Z
Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` is only invoked by the Guard component after it has detected a disconnect (WG handshake timeout, ICE disconnect, etc.). Therefore, when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer`, the existing relay conn entry is stale by contract — Guard serializes offers post-disconnect, so the collision cannot be from a racing healthy connection. Unconditionally calling `CloseConnByPeerKey` and retrying `OpenConn` is the correct and intentional behavior. Do not flag this as a potential teardown of a healthy relay connection.
🧠 Learnings used
Learnt from: fpenezic
Repo: netbirdio/netbird PR: 5866
File: client/internal/peer/worker_relay.go:66-75
Timestamp: 2026-04-21T06:14:08.311Z
Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` is only invoked by the Guard component after it has detected a disconnect (WG handshake timeout, ICE disconnect, etc.). Therefore, when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer`, the existing relay conn entry is stale by contract — Guard serializes offers post-disconnect, so the collision cannot be from a racing healthy connection. Unconditionally calling `CloseConnByPeerKey` and retrying `OpenConn` is the correct and intentional behavior. Do not flag this as a potential teardown of a healthy relay connection.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5826
File: client/internal/portforward/manager.go:35-38
Timestamp: 2026-04-08T14:55:48.251Z
Learning: In `client/internal/portforward/manager.go` (netbirdio/netbird), persisting permanent-lease (ttl=0/UPnP error 725 fallback) port mapping state for crash-recovery cleanup is intentionally deferred to a follow-up PR. The blocker is that `State.Cleanup` requires NAT gateway re-discovery via `nat.DiscoverGateway`, which can block startup for ~10 seconds when no gateway is present, affecting all clients. The TODO comment at line ~35 documents this constraint. Do not flag the missing state persistence for permanent leases as a blocking issue in this PR.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5698
File: management/server/account.go:2549-2595
Timestamp: 2026-04-08T06:14:34.119Z
Learning: In `management/server/account.go` (netbirdio/netbird, PR `#5698`), `updatePeerIPv6InTransaction` intentionally does NOT apply the same host-part filtering that `AllocateRandomPeerIPv6()` uses (e.g., avoiding the all-zero host / subnet-router anycast address per RFC 4291 §2.6.1). IPv6 has no broadcast address, and subnet-router anycast is irrelevant for overlay addresses. The allocator avoids these addresses by convention only; an admin explicitly setting a specific peer IPv6 via the API is permitted to use any valid address within the account's v6 network range. Do not flag the absence of allocator-style host-part validation in `updatePeerIPv6InTransaction` as a bug.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5688
File: client/firewall/uspfilter/filter.go:1281-1292
Timestamp: 2026-03-25T09:21:25.418Z
Learning: In netbirdio/netbird (`client/firewall/uspfilter/filter.go`), IPv4 and IPv6 fragments are intentionally allowed through (fast-path `return false` in `filterInbound`) without going through `peerACLsBlock`, `routeACLsPass`, or the forwarder. This is pre-existing behaviour tracked by the TODO comment "pass fragments of routed packets to forwarder". Do not flag the IPv6 fragment fast-path (`d.ip6.NextHeader == layers.IPProtocolIPv6Fragment` returning `(true, true)`) as a security regression; it mirrors the existing IPv4 fragment handling by design.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5698
File: management/internals/shared/grpc/conversion.go:0-0
Timestamp: 2026-03-26T09:09:54.859Z
Learning: In netbirdio/netbird (management/internals/shared/grpc/conversion.go), the two IPv6 guards in `toPeerConfig` and `ToSyncResponse` serve different purposes and must NOT be unified. `toPeerConfig` uses `peer.IPv6.IsValid() && network.NetV6.IP != nil` because it constructs a prefix from the network mask and needs to guard against a nil mask. `ToSyncResponse` uses `peer.SupportsIPv6() && peer.IPv6.IsValid()` for `includeIPv6` and does not need the network object. Additionally, `peer.IPv6.IsValid() == true` is already proof that the peer is in an `IPv6EnabledGroups` group — IPv6 is only allocated to peers in those groups — so no additional group-membership gate is needed at the gRPC conversion layer. Do not flag these two distinct checks as a unification opportunity.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5891
File: util/capture/filter.go:164-183
Timestamp: 2026-04-15T10:26:03.137Z
Learning: In `util/capture/filter.go` (netbirdio/netbird, PR `#5891`), `parseIPv4Info` does not guard L4 port extraction on fragment offset == 0. Non-initial IPv4 fragments will have their payload bytes misread as src/dst ports, which can cause spurious matches for port-based filters. The fix (checking `binary.BigEndian.Uint16(data[6:8]) & 0x1fff == 0` before reading ports) was deferred to a future PR. Do not treat the missing fragment-offset guard as a blocking issue for PR `#5891`.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5668
File: client/firewall/create_linux.go:45-59
Timestamp: 2026-03-23T19:24:39.071Z
Learning: In netbirdio/netbird (client/firewall/create_linux.go), `AllowNetbird()` on the native firewall (nftables/iptables) adds blanket accept rules. These rules should ONLY be called when USPFilter wraps the native firewall as a backend (in-process filtering): USPFilter does per-peer ACL enforcement and needs the native firewall to pass all traffic through. When the native firewall is used directly (returned without USPFilter wrapping), `AllowNetbird()` must NOT be called, because the native firewall manages its own ACL rules and blanket accepts would bypass them. Do not flag the absence of `AllowNetbird()` on the native-only return path as a bug.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5652
File: client/cmd/status.go:221-221
Timestamp: 2026-03-22T18:18:11.890Z
Learning: In netbirdio/netbird (`client/cmd/status.go`), `checkStartup` reads relay availability from the status recorder, which is updated in real time by the relay client on connect/disconnect. `ShouldRunProbes` only gates STUN/TURN reachability checks (blocking probes via `ProbeAllWaitResult`). Since `checkStartup` does not evaluate STUN/TURN probe results, passing `ShouldRunProbes=false` for startup health checks is intentional and correct — it avoids blocking latency with no benefit. Do not flag `getStatus(ctx, isStartup, false)` in `runHealthCheck` as a bug.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5873
File: client/firewall/uspfilter/hooks_filter.go:46-57
Timestamp: 2026-04-13T13:47:11.779Z
Learning: In `client/firewall/uspfilter/hooks_filter.go` (netbirdio/netbird, PR `#5873`), `HooksFilter.FilterOutbound` intentionally does not skip fragmented IPv4 packets or validate `ihl >= 20` before reading the L4 destination port. The rationale: the hook only fires when both `dstIP` AND `dstPort` match the registered DNS hook (a specific IP + port 53). For non-first fragments, `dstPort` would be read from payload bytes — the probability of those bytes equalling 53 while simultaneously matching the DNS server IP is negligible. Even if such a spurious match occurred, the hook callback would reject the garbage data as a malformed DNS packet. The fragment-skip guard is unnecessary complexity for this minimal hook-only filter. Do not re-flag the absence of fragment or IHL bounds checks in `HooksFilter` as a safety issue.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5926
File: client/internal/sleep/detector_darwin.go:147-213
Timestamp: 2026-04-20T05:15:45.913Z
Learning: In netbirdio/netbird, `SleepHandler` (client/server/sleep.go) already serializes sleep/wake lifecycle events, so a per-`Detector` callback mutex in `client/internal/sleep/detector_darwin.go` is intentionally omitted as redundant. Do not flag concurrent callback invocations on `Detector.triggerCallback` as a bug; the serialization guarantee lives in the layer above.
Learnt from: pappz
Repo: netbirdio/netbird PR: 5770
File: relay/server/listener/quic/conn.go:34-38
Timestamp: 2026-04-01T18:47:37.952Z
Learning: In netbirdio/netbird relay/server/listener/quic/conn.go, `Conn.Write` intentionally ignores the context parameter (`_ context.Context`) for QUIC `SendDatagram` calls. The design guarantees that the underlying QUIC session is always closed before the associated context is canceled, so `SendDatagram` will return a connection-closed error naturally without requiring a `ctx.Err()` pre-check. Do not flag the unused context in `Conn.Write` as a bug in this codebase.
Learnt from: bcmmbaga
Repo: netbirdio/netbird PR: 5441
File: management/server/user.go:745-749
Timestamp: 2026-02-24T19:32:13.189Z
Learning: In Go codebases like netbirdio/netbird, methods returning (T, error) should follow the convention: if an error is returned, propagate it and return early. When the caller handles the error immediately, explicit nil checks on the returned value are unnecessary; rely on the error to guard control flow and only inspect the value when err is nil. This guidance applies broadly to Store methods such as GetUserByUserID in Go code; prefer early return on error and avoid redundant nil checks after a successful error path.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5530
File: proxy/internal/udp/relay.go:276-278
Timestamp: 2026-03-07T13:59:16.506Z
Learning: For Go 1.25+ projects, use sync.WaitGroup.Go() to run goroutines because it handles Add(1) and defer Done() automatically. Do not flag wg.Go(...) as a compilation error in codebases that require Go 1.25 or newer. If supporting older Go versions, gate such usage with build tags or avoid wg.Go() to maintain compatibility; verify the module targets Go 1.25+ before adopting this pattern.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5688
File: client/firewall/uspfilter/forwarder/icmp.go:268-287
Timestamp: 2026-03-25T09:21:53.814Z
Learning: When using gVisor’s `header.ICMPv6Checksum(ICMPv6ChecksumParams)` (`gvisor.dev/gvisor/pkg/tcpip/header`), the pseudo-header checksum is computed internally from `Src`, `Dst`, and `Header`. Use `PayloadCsum`/`PayloadLen` only for scattered-buffer cases where the ICMPv6 payload bytes are not included in `Header`. If `Header` already contains the complete ICMPv6 message (ICMPv6 header + all data), call `ICMPv6Checksum` with `PayloadCsum: 0` and `PayloadLen: 0` (or omit those fields). Do not pass a separately computed pseudo-header checksum (e.g., `header.PseudoHeaderChecksum(...)`) as `PayloadCsum` in that case, because it will be double-counted and produce an incorrect checksum.
There was a problem hiding this comment.
Thanks for the concrete options - went through all three.
Option A (explicit stale flag) - cleanest surface, but has a gap: if the remote peer experiences a NAT IP change and re-sends an offer before our side detects the disconnect locally (e.g. our WG handshake watcher hasn't timed out yet), the flag will be false on our end and we'd skip the close - leaving us with the original bug. The flag correctly covers the case where we initiated the recovery (onWGDisconnected with priority == Relay), but not the case where the remote side's state changed and ours is lagging behind.
Option B (health probe) - I don't think this works for our setup. The relay conn is a proxied UDP pipe; a zero-byte read blocks until a packet arrives, and a write to the local socket will always succeed regardless of whether the relay server still has a live pipe to the remote peer. There's no application-layer liveness signal on the relay conn itself - the authoritative signal is the WG handshake, which is what the original bug is about. Happy to be corrected if I'm missing a cheap probe.
Option C (generation counter) - agreed this is the most robust, but it's a wire-protocol change that would need coordination across signal/management and older clients. Too broad for this PR.
Proposal - hybrid A + per-peer serialization:
- Explicit
relayConnStale atomic.BoolonWorkerRelay, set in two places:Conn.onWGDisconnectedwhencurrentConnPriority == Relay(we detected the disconnect)- Defensively after any explicit
CloseConn()path
OnNewOfferonly takes the close-and-retry path when the flag is set, then clears it.- Per-peer mutex on
OnNewOfferso rapid back-to-back offers from the remote can't overlap with an in-flight close/reopen - the second offer waits for the first to finish, sees the fresh conn, andOpenConnreturns an existing-but-healthy conn which we can reuse. - Short cooldown (e.g. 5s) on the flag to prevent pathological loops if the flag somehow gets stuck in
trueafter a successful reopen.
This trades off completeness (we no longer auto-recover from a remote-initiated state change we haven't noticed yet) for correctness (we stop tearing down healthy connections). The original bug was narrower than I framed it - it's specifically about our side holding a stale entry after remote went through a transport-layer change we should have observed but didn't. The flag captures exactly that window.
Does this match what you had in mind for Option A, or were you thinking of a different signal source for setting the flag?
There was a problem hiding this comment.
@fpenezic, the analysis is sharp and the proposal is sound. A few thoughts:
On the gap in Option A: You're right that it's real, but it's acceptable. In the remote-initiated NAT change where our WGWatcher hasn't fired yet, the worst case is now "delayed recovery (WGWatcher timeout, typically ~3 min)" instead of "indefinite tunnel death." WGWatcher will eventually fire, set the flag, and the next offer cycle recovers. That's a meaningful improvement over the status quo.
Additional stale flag signal: onRelayClientDisconnected → onRelayDisconnected is another natural place to set the flag. When the relay server does send a close event (the happy path), that definitively marks the conn as stale — no WGWatcher timeout needed. This closes part of the gap without adding complexity:
// in onRelayClientDisconnected or wherever onRelayDisconnected is handled
w.relayConnStale.Store(true)
go w.conn.onRelayDisconnected()This way the flag is set by three events: our WGWatcher fires, the relay server sends a close, or CloseConn() is called directly.
On the cooldown: If the flag is always cleared atomically inside OnNewOffer immediately after a successful reopen, there's no stuck-flag scenario — the only way it stays true is if OpenConn fails, in which case the conn didn't get re-established anyway. I'd skip the 5s cooldown and rely on compareAndSwap-style clear-on-success instead.
On the per-peer mutex: Worth checking whether OnNewOffer is already serialized by its caller. If the signal handler dispatches offers for a given peer on a single goroutine, the mutex is redundant. If offers from different signal messages can race (e.g., separate goroutines), the mutex is necessary. Do you know the call chain? If it's already serialized, you can drop step 3 entirely and simplify.
Overall the hybrid A approach is the right call — it's correct (no healthy-conn teardown) and recovers from the common case (we detect the disconnect first) without the protocol complexity of Option C.
🧠 Learnings used
Learnt from: fpenezic
Repo: netbirdio/netbird PR: 5866
File: client/internal/peer/worker_relay.go:71-80
Timestamp: 2026-04-22T22:05:59.868Z
Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` fires on ALL offer/answer exchanges (normal re-keying included), NOT only after Guard detects a disconnect. Unconditionally calling `CloseConnByPeerKey` when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer` causes healthy relay connections to be torn down, producing an infinite reconnect loop (observed at 3600+ iterations/hour in multi-peer scenarios). A proper staleness check (e.g., an explicit stale flag set by WGWatcher before queuing the offer, or a health probe on the existing conn) is required before closing the existing connection.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5826
File: client/internal/portforward/manager.go:35-38
Timestamp: 2026-04-08T14:55:48.251Z
Learning: In `client/internal/portforward/manager.go` (netbirdio/netbird), persisting permanent-lease (ttl=0/UPnP error 725 fallback) port mapping state for crash-recovery cleanup is intentionally deferred to a follow-up PR. The blocker is that `State.Cleanup` requires NAT gateway re-discovery via `nat.DiscoverGateway`, which can block startup for ~10 seconds when no gateway is present, affecting all clients. The TODO comment at line ~35 documents this constraint. Do not flag the missing state persistence for permanent leases as a blocking issue in this PR.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5652
File: client/cmd/status.go:221-221
Timestamp: 2026-03-22T18:18:11.890Z
Learning: In netbirdio/netbird (`client/cmd/status.go`), `checkStartup` reads relay availability from the status recorder, which is updated in real time by the relay client on connect/disconnect. `ShouldRunProbes` only gates STUN/TURN reachability checks (blocking probes via `ProbeAllWaitResult`). Since `checkStartup` does not evaluate STUN/TURN probe results, passing `ShouldRunProbes=false` for startup health checks is intentional and correct — it avoids blocking latency with no benefit. Do not flag `getStatus(ctx, isStartup, false)` in `runHealthCheck` as a bug.
Learnt from: pappz
Repo: netbirdio/netbird PR: 5770
File: relay/server/listener/quic/conn.go:34-38
Timestamp: 2026-04-01T18:47:37.952Z
Learning: In netbirdio/netbird relay/server/listener/quic/conn.go, `Conn.Write` intentionally ignores the context parameter (`_ context.Context`) for QUIC `SendDatagram` calls. The design guarantees that the underlying QUIC session is always closed before the associated context is canceled, so `SendDatagram` will return a connection-closed error naturally without requiring a `ctx.Err()` pre-check. Do not flag the unused context in `Conn.Write` as a bug in this codebase.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5668
File: client/firewall/create_linux.go:45-59
Timestamp: 2026-03-23T19:24:39.071Z
Learning: In netbirdio/netbird (client/firewall/create_linux.go), `AllowNetbird()` on the native firewall (nftables/iptables) adds blanket accept rules. These rules should ONLY be called when USPFilter wraps the native firewall as a backend (in-process filtering): USPFilter does per-peer ACL enforcement and needs the native firewall to pass all traffic through. When the native firewall is used directly (returned without USPFilter wrapping), `AllowNetbird()` must NOT be called, because the native firewall manages its own ACL rules and blanket accepts would bypass them. Do not flag the absence of `AllowNetbird()` on the native-only return path as a bug.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5698
File: management/internals/shared/grpc/conversion.go:0-0
Timestamp: 2026-03-26T09:09:54.859Z
Learning: In netbirdio/netbird (management/internals/shared/grpc/conversion.go), the two IPv6 guards in `toPeerConfig` and `ToSyncResponse` serve different purposes and must NOT be unified. `toPeerConfig` uses `peer.IPv6.IsValid() && network.NetV6.IP != nil` because it constructs a prefix from the network mask and needs to guard against a nil mask. `ToSyncResponse` uses `peer.SupportsIPv6() && peer.IPv6.IsValid()` for `includeIPv6` and does not need the network object. Additionally, `peer.IPv6.IsValid() == true` is already proof that the peer is in an `IPv6EnabledGroups` group — IPv6 is only allocated to peers in those groups — so no additional group-membership gate is needed at the gRPC conversion layer. Do not flag these two distinct checks as a unification opportunity.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5698
File: management/server/account.go:2549-2595
Timestamp: 2026-04-08T06:14:34.119Z
Learning: In `management/server/account.go` (netbirdio/netbird, PR `#5698`), `updatePeerIPv6InTransaction` intentionally does NOT apply the same host-part filtering that `AllocateRandomPeerIPv6()` uses (e.g., avoiding the all-zero host / subnet-router anycast address per RFC 4291 §2.6.1). IPv6 has no broadcast address, and subnet-router anycast is irrelevant for overlay addresses. The allocator avoids these addresses by convention only; an admin explicitly setting a specific peer IPv6 via the API is permitted to use any valid address within the account's v6 network range. Do not flag the absence of allocator-style host-part validation in `updatePeerIPv6InTransaction` as a bug.
Learnt from: bcmmbaga
Repo: netbirdio/netbird PR: 5441
File: management/server/user.go:745-749
Timestamp: 2026-02-24T19:32:13.189Z
Learning: In Go codebases like netbirdio/netbird, methods returning (T, error) should follow the convention: if an error is returned, propagate it and return early. When the caller handles the error immediately, explicit nil checks on the returned value are unnecessary; rely on the error to guard control flow and only inspect the value when err is nil. This guidance applies broadly to Store methods such as GetUserByUserID in Go code; prefer early return on error and avoid redundant nil checks after a successful error path.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5530
File: proxy/internal/udp/relay.go:276-278
Timestamp: 2026-03-07T13:59:16.506Z
Learning: For Go 1.25+ projects, use sync.WaitGroup.Go() to run goroutines because it handles Add(1) and defer Done() automatically. Do not flag wg.Go(...) as a compilation error in codebases that require Go 1.25 or newer. If supporting older Go versions, gate such usage with build tags or avoid wg.Go() to maintain compatibility; verify the module targets Go 1.25+ before adopting this pattern.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5688
File: client/firewall/uspfilter/forwarder/icmp.go:268-287
Timestamp: 2026-03-25T09:21:53.814Z
Learning: When using gVisor’s `header.ICMPv6Checksum(ICMPv6ChecksumParams)` (`gvisor.dev/gvisor/pkg/tcpip/header`), the pseudo-header checksum is computed internally from `Src`, `Dst`, and `Header`. Use `PayloadCsum`/`PayloadLen` only for scattered-buffer cases where the ICMPv6 payload bytes are not included in `Header`. If `Header` already contains the complete ICMPv6 message (ICMPv6 header + all data), call `ICMPv6Checksum` with `PayloadCsum: 0` and `PayloadLen: 0` (or omit those fields). Do not pass a separately computed pseudo-header checksum (e.g., `header.PseudoHeaderChecksum(...)`) as `PayloadCsum` in that case, because it will be double-counted and produce an incorrect checksum.
There was a problem hiding this comment.
Verified the call chain - AsyncOfferListener already serializes per-peer (one goroutine at a time; newer offers during processing just update latest and run after current completes). So step 3 is redundant as you suspected. Dropping the mutex.
Final plan:
relayConnStale atomic.BoolonWorkerRelay- Set in three places:
Conn.onWGDisconnected(priority == Relay branch) - local WG watcher firedonRelayClientDisconnected- relay server sent close eventWorkerRelay.CloseConn()- any explicit teardown
OnNewOffer: close-and-retry only when flag is set, clear on successful reopen (compareAndSwap)- No cooldown, no mutex.
The tradeoff: remote-initiated NAT change where our WG watcher hasn't fired yet means delayed recovery (~WG handshake timeout) instead of tunnel death. Net improvement over current master. Pushing shortly.
There was a problem hiding this comment.
@fpenezic, the plan is solid. A few notes for the implementation:
On the CAS: Since AsyncOfferListener already serializes OnNewOffer per-peer, there's no concurrent accessor on relayConnStale from that direction — a plain Load() / Store(false) is race-free and equivalent to CAS here. The only concurrent writers are onRelayClientDisconnected (goroutine) and CloseConn() (called from various places), but those are write-only (always Store(true)), so a torn read isn't possible. No need for the overhead of a CAS loop.
One edge case to double-check: onRelayClientDisconnected spawns go w.conn.onRelayDisconnected() — so the flag set and the offer dispatch are on separate goroutines. Since AsyncOfferListener already handles the "offer arrives before flag is set" race by queuing and re-running, this shouldn't be an issue, but worth a quick trace to confirm the ordering is guaranteed on that path.
CloseConn() as a signal source: Note that CloseConn() is also called inside onRelayConnectionIsReady if AddCloseListener fails. That path will correctly mark the conn stale before returning, which is the right behavior.
Overall the plan is correct and the tradeoff (delayed recovery on remote-side NAT change we haven't observed yet, vs. infinite reconnect loop) is clearly the right call. Looking forward to the push.
🧠 Learnings used
Learnt from: fpenezic
Repo: netbirdio/netbird PR: 5866
File: client/internal/peer/worker_relay.go:71-80
Timestamp: 2026-04-22T22:05:59.868Z
Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` fires on ALL offer/answer exchanges (normal re-keying included), NOT only after Guard detects a disconnect. Unconditionally calling `CloseConnByPeerKey` when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer` causes healthy relay connections to be torn down, producing an infinite reconnect loop (observed at 3600+ iterations/hour in multi-peer scenarios). A proper staleness check (e.g., an explicit stale flag set by WGWatcher before queuing the offer, or a health probe on the existing conn) is required before closing the existing connection.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5826
File: client/internal/portforward/manager.go:35-38
Timestamp: 2026-04-08T14:55:48.251Z
Learning: In `client/internal/portforward/manager.go` (netbirdio/netbird), persisting permanent-lease (ttl=0/UPnP error 725 fallback) port mapping state for crash-recovery cleanup is intentionally deferred to a follow-up PR. The blocker is that `State.Cleanup` requires NAT gateway re-discovery via `nat.DiscoverGateway`, which can block startup for ~10 seconds when no gateway is present, affecting all clients. The TODO comment at line ~35 documents this constraint. Do not flag the missing state persistence for permanent leases as a blocking issue in this PR.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5652
File: client/cmd/status.go:221-221
Timestamp: 2026-03-22T18:18:11.890Z
Learning: In netbirdio/netbird (`client/cmd/status.go`), `checkStartup` reads relay availability from the status recorder, which is updated in real time by the relay client on connect/disconnect. `ShouldRunProbes` only gates STUN/TURN reachability checks (blocking probes via `ProbeAllWaitResult`). Since `checkStartup` does not evaluate STUN/TURN probe results, passing `ShouldRunProbes=false` for startup health checks is intentional and correct — it avoids blocking latency with no benefit. Do not flag `getStatus(ctx, isStartup, false)` in `runHealthCheck` as a bug.
Learnt from: pappz
Repo: netbirdio/netbird PR: 5770
File: relay/server/listener/quic/conn.go:34-38
Timestamp: 2026-04-01T18:47:37.952Z
Learning: In netbirdio/netbird relay/server/listener/quic/conn.go, `Conn.Write` intentionally ignores the context parameter (`_ context.Context`) for QUIC `SendDatagram` calls. The design guarantees that the underlying QUIC session is always closed before the associated context is canceled, so `SendDatagram` will return a connection-closed error naturally without requiring a `ctx.Err()` pre-check. Do not flag the unused context in `Conn.Write` as a bug in this codebase.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5668
File: client/firewall/create_linux.go:45-59
Timestamp: 2026-03-23T19:24:39.071Z
Learning: In netbirdio/netbird (client/firewall/create_linux.go), `AllowNetbird()` on the native firewall (nftables/iptables) adds blanket accept rules. These rules should ONLY be called when USPFilter wraps the native firewall as a backend (in-process filtering): USPFilter does per-peer ACL enforcement and needs the native firewall to pass all traffic through. When the native firewall is used directly (returned without USPFilter wrapping), `AllowNetbird()` must NOT be called, because the native firewall manages its own ACL rules and blanket accepts would bypass them. Do not flag the absence of `AllowNetbird()` on the native-only return path as a bug.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5698
File: management/internals/shared/grpc/conversion.go:0-0
Timestamp: 2026-03-26T09:09:54.859Z
Learning: In netbirdio/netbird (management/internals/shared/grpc/conversion.go), the two IPv6 guards in `toPeerConfig` and `ToSyncResponse` serve different purposes and must NOT be unified. `toPeerConfig` uses `peer.IPv6.IsValid() && network.NetV6.IP != nil` because it constructs a prefix from the network mask and needs to guard against a nil mask. `ToSyncResponse` uses `peer.SupportsIPv6() && peer.IPv6.IsValid()` for `includeIPv6` and does not need the network object. Additionally, `peer.IPv6.IsValid() == true` is already proof that the peer is in an `IPv6EnabledGroups` group — IPv6 is only allocated to peers in those groups — so no additional group-membership gate is needed at the gRPC conversion layer. Do not flag these two distinct checks as a unification opportunity.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5698
File: management/server/account.go:2549-2595
Timestamp: 2026-04-08T06:14:34.119Z
Learning: In `management/server/account.go` (netbirdio/netbird, PR `#5698`), `updatePeerIPv6InTransaction` intentionally does NOT apply the same host-part filtering that `AllocateRandomPeerIPv6()` uses (e.g., avoiding the all-zero host / subnet-router anycast address per RFC 4291 §2.6.1). IPv6 has no broadcast address, and subnet-router anycast is irrelevant for overlay addresses. The allocator avoids these addresses by convention only; an admin explicitly setting a specific peer IPv6 via the API is permitted to use any valid address within the account's v6 network range. Do not flag the absence of allocator-style host-part validation in `updatePeerIPv6InTransaction` as a bug.
Learnt from: bcmmbaga
Repo: netbirdio/netbird PR: 5441
File: management/server/user.go:745-749
Timestamp: 2026-02-24T19:32:13.189Z
Learning: In Go codebases like netbirdio/netbird, methods returning (T, error) should follow the convention: if an error is returned, propagate it and return early. When the caller handles the error immediately, explicit nil checks on the returned value are unnecessary; rely on the error to guard control flow and only inspect the value when err is nil. This guidance applies broadly to Store methods such as GetUserByUserID in Go code; prefer early return on error and avoid redundant nil checks after a successful error path.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5530
File: proxy/internal/udp/relay.go:276-278
Timestamp: 2026-03-07T13:59:16.506Z
Learning: For Go 1.25+ projects, use sync.WaitGroup.Go() to run goroutines because it handles Add(1) and defer Done() automatically. Do not flag wg.Go(...) as a compilation error in codebases that require Go 1.25 or newer. If supporting older Go versions, gate such usage with build tags or avoid wg.Go() to maintain compatibility; verify the module targets Go 1.25+ before adopting this pattern.
Learnt from: lixmal
Repo: netbirdio/netbird PR: 5688
File: client/firewall/uspfilter/forwarder/icmp.go:268-287
Timestamp: 2026-03-25T09:21:53.814Z
Learning: When using gVisor’s `header.ICMPv6Checksum(ICMPv6ChecksumParams)` (`gvisor.dev/gvisor/pkg/tcpip/header`), the pseudo-header checksum is computed internally from `Src`, `Dst`, and `Header`. Use `PayloadCsum`/`PayloadLen` only for scattered-buffer cases where the ICMPv6 payload bytes are not included in `Header`. If `Header` already contains the complete ICMPv6 message (ICMPv6 header + all data), call `ICMPv6Checksum` with `PayloadCsum: 0` and `PayloadLen: 0` (or omit those fields). Do not pass a separately computed pseudo-header checksum (e.g., `header.PseudoHeaderChecksum(...)`) as `PayloadCsum` in that case, because it will be double-counted and produce an incorrect checksum.
|
Converting to draft - I discovered a regression during extended testing. Bug: The Root cause: my earlier reasoning that Plan: Add a safeguard so the close-and-retry path only runs when the existing connection is actually stale, e.g.:
Will push a follow-up commit once I have verified the fix in the same reproduction environment (PPPoE reconnect + IP rotation). |
…tunnel after NAT IP change
The previous fix unconditionally closed and reopened the existing relay conn whenever OnNewOffer hit ErrConnAlreadyExists. That caused an infinite tear-down/rebuild loop when the remote peer sent rapid successive offers (e.g. during reconnection): close → reopen → remote sees teardown → sends new offer → ErrConnAlreadyExists → close → loop. Introduce a relayConnStale atomic flag that is set only when an event indicates the existing entry is no longer backed by a live peer session: local WG handshake timeout, relay server close, or explicit CloseConn. OnNewOffer only tears down and reopens when the flag is set; otherwise it reuses the existing healthy connection. Signal sources for the flag: - conn.onWGDisconnected (Relay path) → MarkStale before CloseConn - WorkerRelay.CloseConn → marks stale on close - WorkerRelay.onRelayClientDisconnected → marks stale on server close
When WireGuard handshake times out while running over ICE, onWGDisconnected calls workerICE.Close(). Close() sets w.agent=nil synchronously before pion's ICE library fires the asynchronous ConnectionStateClosed event. By the time onConnectionStateChange runs closeAgent(), the `w.agent == agent` guard fails (w.agent is already nil) and the session ID is not rotated. Without session ID rotation, the next ICE offer carries the same local session ID. The remote peer in OnNewOffer compares remoteSessionID against the incoming offer's sessionID, finds them equal, and skips agent recreation — reusing the existing agent and its stale candidates from the broken network state. Observed in production: 30s reconnect loop after a NAT/conntrack event with the same offer session ID logged ~20+ times in a row, WG handshake never recovering despite "ICE connection succeeded" on every cycle. Mirror the existing Relay-path behavior: explicitly rotate the session ID before closing the ICE worker so the remote peer always recreates its agent after a WG timeout on ICE.
b0b9b85 to
c0b3e4e
Compare
|



Summary
Fix peer reconnection loops after a network event (PPPoE reconnect, NAT/conntrack flush, IP rotation) leaves transport state stale on either the relay or ICE path. Both manifest as the WireGuard tunnel going silent while the client UI reports
Connected, with the connection never recovering on its own.This PR addresses two distinct but related failure modes observed in production on a peer behind PPPoE NAT (Raspberry Pi) talking to a peer with a public IP (Oracle Cloud).
Problem 1 - Relay path: stale
ErrConnAlreadyExistsreuseWhen the peer is running over relay and a network event invalidates the existing relay session,
WorkerRelay.OnNewOffercallsrelayManager.OpenConnwhich returnsErrConnAlreadyExistsbecause the relay client still holds a map entry for the peer key. The previous behavior was to silently bail out and reuse that entry - which is dead.Fix (commits 1–4):
ErrConnAlreadyExists, close the existing relay conn and reopen it.CloseConnByPeerKeyto the correct relay client (home vs. foreign server) - addresses CodeRabbit feedback.relayConnStale atomic.Boolflag so we only tear down when something has signaled that the entry is no longer backed by a live peer session. Without this gate, rapid successive offers from the remote peer cause an infinite tear-down/rebuild loop (observed: 3608 cycles in ~1 hour, ~9 Mbit/s of constant traffic).conn.onWGDisconnected(Relay path),WorkerRelay.CloseConn,WorkerRelay.onRelayClientDisconnected.Problem 2 - ICE path: session ID not rotated on WG timeout
When WireGuard handshake times out while running over ICE,
onWGDisconnectedcallsworkerICE.Close().Close()setsw.agent = nilsynchronously before pion's ICE library fires the asynchronousConnectionStateClosedevent. By the timeonConnectionStateChangerunscloseAgent(), thew.agent == agentguard fails (w.agentis alreadynil) and the session ID is not rotated.Without rotation, the next ICE offer carries the same local session ID. The remote peer in
OnNewOffercomparesremoteSessionIDagainst the incoming offer'sSessionID, finds them equal, and skips agent recreation - reusing the existing agent and its stale candidates from the broken network state.Observed in production: 30s reconnect loop with the same offer session ID logged 70+ times in a row, "ICE connection succeeded" on every cycle, WG handshake never recovering. UI shows
Connected, P2Pwhile last successful WG handshake is 40+ minutes old.Fix (commit 5): Rotate the session ID explicitly in
onWGDisconnectedfor the ICE case (mirroring the existing Relay-path behavior we added in commit 1) so the remote peer always recreates its ICE agent after a WG timeout on ICE.Test plan
go build ./...)netbird down/up(waiting for the next nightly PPPoE cycle)Notes
The two problems share the same trigger (network event invalidates per-peer state without invalidating the local control plane) but live in different transports, so the fixes are orthogonal. Each is necessary on its own:
Both fixes are strictly defensive - they can only fire on
ErrConnAlreadyExists(relay) or after a WG handshake timeout (ICE), so there's no behavior change on the happy path.