fix: close stale relay connection on ErrConnAlreadyExists to recover … #5866

coderabbitai · 2026-04-20T21:26:34Z

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f "worker_relay.go" --exec head -100 {}

Repository: netbirdio/netbird

Length of output: 2725

🏁 Script executed:

fd -t f "client.go" -path "*/relay/client/*" --exec wc -l {}

Repository: netbirdio/netbird

Length of output: 229

🏁 Script executed:

rg "ErrConnAlreadyExists" -A 3 -B 3

Repository: netbirdio/netbird

Length of output: 1384

🏁 Script executed:

fd -t f "worker_relay.go" -exec cat -n {} \;

Repository: netbirdio/netbird

Length of output: 291

🏁 Script executed:

fd -path "*/relay/client/client.go"

Repository: netbirdio/netbird

Length of output: 229

🏁 Script executed:

fd -path "*/relay/client/client.go" -exec wc -l {} \;

Repository: netbirdio/netbird

Length of output: 229

🏁 Script executed:

fd -path "*/relay/client/client.go" -exec head -200 {} \;

Repository: netbirdio/netbird

Length of output: 229

🏁 Script executed:

rg "func.*Manager.*OpenConn" -A 20

Repository: netbirdio/netbird

Length of output: 1284

🏁 Script executed:

rg "CloseConnByPeerKey" -A 10 -B 3

Repository: netbirdio/netbird

Length of output: 4079

🏁 Script executed:

rg "OnNewOffer" -B 5 -A 2

Repository: netbirdio/netbird

Length of output: 2252

🏁 Script executed:

rg "type WorkerRelay" -A 50 | head -80

Repository: netbirdio/netbird

Length of output: 3608

🏁 Script executed:

rg "func.*onRelayConnectionIsReady" -A 15

Repository: netbirdio/netbird

Length of output: 979

🏁 Script executed:

rg "type Handshaker" -A 30

Repository: netbirdio/netbird

Length of output: 2553

🏁 Script executed:

rg "AddRelayListener" -B 5 -A 10

Repository: netbirdio/netbird

Length of output: 3701

🏁 Script executed:

rg "relayListeners" -B 3 -A 10

Repository: netbirdio/netbird

Length of output: 43

🏁 Script executed:

rg "type AsyncOfferListener" -A 40

Repository: netbirdio/netbird

Length of output: 2686

🏁 Script executed:

rg "NewAsyncOfferListener" -A 30

Repository: netbirdio/netbird

Length of output: 5775

🏁 Script executed:

rg "OnRemoteOffer" -B 3 -A 10

Repository: netbirdio/netbird

Length of output: 4594

🏁 Script executed:

cat -n client/internal/peer/handshaker_listener.go

Repository: netbirdio/netbird

Length of output: 1653

🏁 Script executed:

rg "CloseConnByPeerKey" --type go -B 2 -A 8

Repository: netbirdio/netbird

Length of output: 3587

🏁 Script executed:

rg "onRelayClientDisconnected" -B 3 -A 15

Repository: netbirdio/netbird

Length of output: 1652

🏁 Script executed:

rg "func.*onRelayConnectionIsReady" -B 2 -A 20

Repository: netbirdio/netbird

Length of output: 1331

🏁 Script executed:

rg "relayLock" -B 2 -A 2

Repository: netbirdio/netbird

Length of output: 1090

🏁 Script executed:

rg "type.*Conn.*struct" -A 20 shared/relay/client/client.go | head -80

Repository: netbirdio/netbird

Length of output: 43

🏁 Script executed:

rg "conns\[" shared/relay/client/client.go -B 2 -A 2

Repository: netbirdio/netbird

Length of output: 1303

🏁 Script executed:

rg "delete\(c\.conns" shared/relay/client/client.go -B 5 -A 2

Repository: netbirdio/netbird

Length of output: 1092

🏁 Script executed:

rg "type connContainer" shared/relay/client/client.go -A 30

Repository: netbirdio/netbird

Length of output: 911

🏁 Script executed:

rg "type Conn struct" shared/relay/client/client.go -A 20

Repository: netbirdio/netbird

Length of output: 43

🏁 Script executed:

rg "func.*Close\(\)" client/internal/peer/worker_relay.go -A 15

Repository: netbirdio/netbird

Length of output: 43

ErrConnAlreadyExists does not prove a relay connection is stale.

This error only indicates an entry exists in the relay client's connection map; it carries no information about connection health or staleness. With the unconditional CloseConnByPeerKey call, a healthy relay connection from a prior handshake can be forcibly torn down when a new offer arrives and encounters the same peerID already in the map. Add a disconnected/stale state check before closing, or ensure OnNewOffer cannot run while a live relay connection for this peer exists.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@client/internal/peer/worker_relay.go` around lines 66 - 75, The code currently treats relayClient.ErrConnAlreadyExists as proof the existing relay is stale and unconditionally calls w.relayManager.CloseConnByPeerKey; instead, check the existing connection's health/state before closing or prevent concurrent OnNewOffer for the same peer. Modify the branch handling relayClient.ErrConnAlreadyExists to first query the existing connection via w.relayManager (e.g., a method like GetConnByPeerKey or HasActiveConn) and only call w.relayManager.CloseConnByPeerKey when that connection reports closed/stale (or fail a liveness check); otherwise reuse the active connection or bail out/serialize the new offer (e.g., with a per-peer lock) rather than tearing down a healthy relay, keeping the subsequent OpenConn call only when a close was actually performed.

Thanks for flagging. The behavior is intentional - reasoning:

WorkerRelay.OnNewOffer is only invoked when the remote peer sends a fresh offer through the Guard component. Guard only sends offers after it has detected a disconnect (WG handshake timeout, ICE disconnect, etc.). If we're receiving a new offer, the remote peer already believes the existing connection is broken.

Given that:

Remote peer has effectively invalidated the connection by initiating re-handshake. Reusing our side's stale conn entry (the original bug) leaves WG packets going into a dead pipe. Tearing it down and opening a fresh one matches the remote peer's state.

ErrConnAlreadyExists after an offer is the pathological case we're fixing. Under normal operation, OpenConn doesn't collide - each offer/answer cycle happens on a fresh session. The only time we hit ErrConnAlreadyExists is when the map wasn't cleaned up (e.g. remote NAT IP change without a relay-level close event reaching us).

Even in the theoretical race (both peers send offers simultaneously and one just established a healthy conn that we then close), the cost is one extra round-trip: the next offer will re-establish. Compare to the current failure mode: indefinite tunnel death until manual restart.

A liveness check on the relay conn isn't straightforward - the connection is a proxied UDP pipe with no heartbeat of its own. The authoritative liveness signal is the WG handshake, which WGWatcher already owns. Adding a separate liveness probe on the relay conn would duplicate that logic.

If the concern is the simultaneous-offer race, a cleaner guard would be per-peer serialization of OnNewOffer - but that's orthogonal and the current race-recovery cost (one extra offer cycle) seems acceptable vs. the bug being fixed.

Can add a code comment explaining this if that would help future readers.

@fpenezic, thank you for the thorough explanation — the reasoning is solid and I withdraw the concern.

The key invariant I missed is that OnNewOffer is gated by Guard's disconnect detection, so by the time a new offer arrives with a colliding peerID, the existing relay conn is stale by contract, not just by inference. The simultaneous-offer edge case has a bounded, self-healing cost (one extra offer cycle) which is clearly preferable to indefinite tunnel death.

A brief inline comment explaining the Guard invariant would definitely help future readers avoid the same confusion — something like:

// ErrConnAlreadyExists here means the conn map was not cleaned up after the // previous session (e.g. remote NAT IP change with no relay-level close event). // Guard only triggers OnNewOffer after detecting a disconnect, so the existing // entry is stale by contract. Close it and open a fresh conn to match the // remote peer's state. if errors.Is(err, relayClient.ErrConnAlreadyExists) {

(ᴗ˳ᴗ)

✏️ Learnings added

Learnt from: fpenezic Repo: netbirdio/netbird PR: 5866 File: client/internal/peer/worker_relay.go:66-75 Timestamp: 2026-04-21T06:14:04.853Z Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` is only invoked by the Guard component after it has detected a disconnect (WG handshake timeout, ICE disconnect, etc.). Therefore, when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer`, the existing relay conn entry is stale by contract — Guard serializes offers post-disconnect, so the collision cannot be from a racing healthy connection. Unconditionally calling `CloseConnByPeerKey` and retrying `OpenConn` is the correct and intentional behavior. Do not flag this as a potential teardown of a healthy relay connection.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: lixmal Repo: netbirdio/netbird PR: 5826 File: client/internal/portforward/manager.go:35-38 Timestamp: 2026-04-08T14:55:48.251Z Learning: In `client/internal/portforward/manager.go` (netbirdio/netbird), persisting permanent-lease (ttl=0/UPnP error 725 fallback) port mapping state for crash-recovery cleanup is intentionally deferred to a follow-up PR. The blocker is that `State.Cleanup` requires NAT gateway re-discovery via `nat.DiscoverGateway`, which can block startup for ~10 seconds when no gateway is present, affecting all clients. The TODO comment at line ~35 documents this constraint. Do not flag the missing state persistence for permanent leases as a blocking issue in this PR.

Learnt from: pappz Repo: netbirdio/netbird PR: 5770 File: relay/server/listener/quic/conn.go:34-38 Timestamp: 2026-04-01T18:47:37.952Z Learning: In netbirdio/netbird relay/server/listener/quic/conn.go, `Conn.Write` intentionally ignores the context parameter (`_ context.Context`) for QUIC `SendDatagram` calls. The design guarantees that the underlying QUIC session is always closed before the associated context is canceled, so `SendDatagram` will return a connection-closed error naturally without requiring a `ctx.Err()` pre-check. Do not flag the unused context in `Conn.Write` as a bug in this codebase.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5652 File: client/cmd/status.go:221-221 Timestamp: 2026-03-22T18:18:11.890Z Learning: In netbirdio/netbird (`client/cmd/status.go`), `checkStartup` reads relay availability from the status recorder, which is updated in real time by the relay client on connect/disconnect. `ShouldRunProbes` only gates STUN/TURN reachability checks (blocking probes via `ProbeAllWaitResult`). Since `checkStartup` does not evaluate STUN/TURN probe results, passing `ShouldRunProbes=false` for startup health checks is intentional and correct — it avoids blocking latency with no benefit. Do not flag `getStatus(ctx, isStartup, false)` in `runHealthCheck` as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5530 File: management/internals/modules/reverseproxy/service/manager/manager.go:302-317 Timestamp: 2026-03-07T14:48:20.330Z Learning: In `management/internals/modules/reverseproxy/service/manager/manager.go`, `checkPortConflict` uses `LockingStrengthUpdate` (SELECT FOR UPDATE) to serialize on existing port rows. For an unused port (no existing rows to lock), two concurrent transactions could both pass the DB check, but the proxy's OS-level port binding acts as the hard uniqueness guard — only one process can bind a given port. This dual-layer enforcement (DB best-effort + proxy binding) is the accepted design; no additional DB uniqueness constraint is needed.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/server/account.go:2549-2595 Timestamp: 2026-04-08T06:14:34.119Z Learning: In `management/server/account.go` (netbirdio/netbird, PR `#5698`), `updatePeerIPv6InTransaction` intentionally does NOT apply the same host-part filtering that `AllocateRandomPeerIPv6()` uses (e.g., avoiding the all-zero host / subnet-router anycast address per RFC 4291 §2.6.1). IPv6 has no broadcast address, and subnet-router anycast is irrelevant for overlay addresses. The allocator avoids these addresses by convention only; an admin explicitly setting a specific peer IPv6 via the API is permitted to use any valid address within the account's v6 network range. Do not flag the absence of allocator-style host-part validation in `updatePeerIPv6InTransaction` as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/internals/shared/grpc/conversion.go:0-0 Timestamp: 2026-03-26T09:09:54.859Z Learning: In netbirdio/netbird (management/internals/shared/grpc/conversion.go), the two IPv6 guards in `toPeerConfig` and `ToSyncResponse` serve different purposes and must NOT be unified. `toPeerConfig` uses `peer.IPv6.IsValid() && network.NetV6.IP != nil` because it constructs a prefix from the network mask and needs to guard against a nil mask. `ToSyncResponse` uses `peer.SupportsIPv6() && peer.IPv6.IsValid()` for `includeIPv6` and does not need the network object. Additionally, `peer.IPv6.IsValid() == true` is already proof that the peer is in an `IPv6EnabledGroups` group — IPv6 is only allocated to peers in those groups — so no additional group-membership gate is needed at the gRPC conversion layer. Do not flag these two distinct checks as a unification opportunity.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5668 File: client/firewall/create_linux.go:45-59 Timestamp: 2026-03-23T19:24:39.071Z Learning: In netbirdio/netbird (client/firewall/create_linux.go), `AllowNetbird()` on the native firewall (nftables/iptables) adds blanket accept rules. These rules should ONLY be called when USPFilter wraps the native firewall as a backend (in-process filtering): USPFilter does per-peer ACL enforcement and needs the native firewall to pass all traffic through. When the native firewall is used directly (returned without USPFilter wrapping), `AllowNetbird()` must NOT be called, because the native firewall manages its own ACL rules and blanket accepts would bypass them. Do not flag the absence of `AllowNetbird()` on the native-only return path as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5687 File: client/internal/engine_ssh.go:44-50 Timestamp: 2026-03-25T06:24:52.840Z Learning: In `client/internal/engine_ssh.go` (netbirdio/netbird), the IPv6 SSH setup steps — `AddListener` and `AddInboundDNAT` for the IPv6 overlay address — are intentionally best-effort and independent of each other, mirroring the v4 behavior. IPv6 soft-failure is by design so that IPv4 SSH continues to work when IPv6 cannot be set up. Do not flag the lack of atomicity between the IPv6 listener and IPv6 DNAT as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5926 File: client/internal/sleep/detector_darwin.go:147-213 Timestamp: 2026-04-20T05:15:45.913Z Learning: In netbirdio/netbird, `SleepHandler` (client/server/sleep.go) already serializes sleep/wake lifecycle events, so a per-`Detector` callback mutex in `client/internal/sleep/detector_darwin.go` is intentionally omitted as redundant. Do not flag concurrent callback invocations on `Detector.triggerCallback` as a bug; the serialization guarantee lives in the layer above.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5530 File: proxy/server.go:930-941 Timestamp: 2026-03-07T13:25:18.409Z Learning: In proxy/server.go, `modifyMapping()` intentionally removes old routes before setting up new ones. Old routes hold the port listener, so attempting to bind the same port without first releasing it would result in an address-in-use error. The brief service interruption during modify is an accepted design trade-off, not a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5202 File: client/internal/portforward/manager.go:219-222 Timestamp: 2026-01-30T14:43:13.106Z Learning: In client/internal/portforward/manager.go, the state persistence pattern is designed for retry-on-failure: if DeletePortMapping fails in Stop(), the state is intentionally left intact (via early return) so that cleanupResidual() can attempt deletion again on the next startup. This is the intended behavior, not a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5688 File: client/firewall/uspfilter/filter.go:1281-1292 Timestamp: 2026-03-25T09:21:25.418Z Learning: In netbirdio/netbird (`client/firewall/uspfilter/filter.go`), IPv4 and IPv6 fragments are intentionally allowed through (fast-path `return false` in `filterInbound`) without going through `peerACLsBlock`, `routeACLsPass`, or the forwarder. This is pre-existing behaviour tracked by the TODO comment "pass fragments of routed packets to forwarder". Do not flag the IPv6 fragment fast-path (`d.ip6.NextHeader == layers.IPProtocolIPv6Fragment` returning `(true, true)`) as a security regression; it mirrors the existing IPv4 fragment handling by design.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5675 File: client/iface/device/wg_link_freebsd.go:61-84 Timestamp: 2026-03-24T11:39:03.789Z Learning: In `client/iface/device/wg_link_freebsd.go` (and related platform files), `UpdateAddr` is never called mid-lifetime by the engine. IPv6 overlay address changes are detected in `engine.go` via `hasIPv6Changed` and trigger `ErrResetConnection`, which causes a full client restart and interface recreation from scratch. Therefore, `assignAddr` always runs on a fresh interface, and there is no need to clean up pre-existing OS inet6 aliases in the soft-fail (ClearIPv6) path.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5587 File: shared/management/http/api/openapi.yml:3151-3171 Timestamp: 2026-03-13T06:22:59.436Z Learning: Repo netbirdio/netbird PR `#5587`: For reverse proxy header authentication, HeaderAuthConfig.value is required in requests and cleared in responses. On update, Manager.preserveExistingAuthSecrets() calls preserveHeaderAuthHashes() before validateHeaderAuthValues() (management/internals/modules/reverseproxy/service/manager/manager.go Lines ~555–597), so sending value: "" preserves the existing hashed secret; providing a non-empty value rotates it.

Learnt from: bcmmbaga Repo: netbirdio/netbird PR: 5441 File: management/server/user.go:745-749 Timestamp: 2026-02-24T19:32:13.189Z Learning: In Go codebases like netbirdio/netbird, methods returning (T, error) should follow the convention: if an error is returned, propagate it and return early. When the caller handles the error immediately, explicit nil checks on the returned value are unnecessary; rely on the error to guard control flow and only inspect the value when err is nil. This guidance applies broadly to Store methods such as GetUserByUserID in Go code; prefer early return on error and avoid redundant nil checks after a successful error path.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5530 File: proxy/internal/udp/relay.go:276-278 Timestamp: 2026-03-07T13:59:16.506Z Learning: For Go 1.25+ projects, use sync.WaitGroup.Go() to run goroutines because it handles Add(1) and defer Done() automatically. Do not flag wg.Go(...) as a compilation error in codebases that require Go 1.25 or newer. If supporting older Go versions, gate such usage with build tags or avoid wg.Go() to maintain compatibility; verify the module targets Go 1.25+ before adopting this pattern.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5688 File: client/firewall/uspfilter/forwarder/icmp.go:268-287 Timestamp: 2026-03-25T09:21:53.814Z Learning: When using gVisor’s `header.ICMPv6Checksum(ICMPv6ChecksumParams)` (`gvisor.dev/gvisor/pkg/tcpip/header`), the pseudo-header checksum is computed internally from `Src`, `Dst`, and `Header`. Use `PayloadCsum`/`PayloadLen` only for scattered-buffer cases where the ICMPv6 payload bytes are not included in `Header`. If `Header` already contains the complete ICMPv6 message (ICMPv6 header + all data), call `ICMPv6Checksum` with `PayloadCsum: 0` and `PayloadLen: 0` (or omit those fields). Do not pass a separately computed pseudo-header checksum (e.g., `header.PseudoHeaderChecksum(...)`) as `PayloadCsum` in that case, because it will be double-counted and produce an incorrect checksum.

Reopening this - my "Guard invariant" argument was wrong. OnNewOffer fires on normal offer/answer exchanges too, not only after disconnect detection. Extended testing showed an infinite reconnect loop (3600+ iterations/hour) in multi-peer scenarios where your original concern manifested exactly as you described - healthy connections being torn down.

Converting to draft while I add proper staleness detection (cooldown or liveness check). Thanks for catching this early - the "bounded, self-healing cost" claim I made was also wrong, the cost compounds into a persistent loop.

@fpenezic, thanks for following up — glad you caught this before it landed. Removing the incorrect learning now.

For staleness detection, a few concrete approaches to consider:

Option A — Explicit stale flag (cleanest)
Add an atomic.Bool relayConnStale field to WorkerRelay. WGWatcher/Guard sets it true when it detects a disconnect before queuing the new offer. OnNewOffer checks the flag and only calls CloseConnByPeerKey when it is set, then clears it. This makes the "stale by contract" invariant explicit in code rather than assumed.

Option B — Health probe on the existing conn
Before closing, attempt a zero-byte write or read with a short deadline on the existing relayedConn. If it fails immediately, it's stale. Adds a small RTT delay on the unhappy path but avoids tearing down healthy connections.

Option C — Session generation counter
Increment a generation counter each time Guard detects a disconnect. Pass it through the offer. OnNewOffer closes the existing conn only when the generation in the offer is newer than what was used to open the current conn. Requires a protocol change but is the most robust.

Option A is probably the least invasive — it keeps all the new plumbing inside WorkerRelay/Conn without touching the relay client or the wire protocol.

✏️ Learnings added

Learnt from: fpenezic Repo: netbirdio/netbird PR: 5866 File: client/internal/peer/worker_relay.go:71-80 Timestamp: 2026-04-22T22:05:59.868Z Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` fires on ALL offer/answer exchanges (normal re-keying included), NOT only after Guard detects a disconnect. Unconditionally calling `CloseConnByPeerKey` when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer` causes healthy relay connections to be torn down, producing an infinite reconnect loop (observed at 3600+ iterations/hour in multi-peer scenarios). A proper staleness check (e.g., an explicit stale flag set by WGWatcher before queuing the offer, or a health probe on the existing conn) is required before closing the existing connection.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

⛔ Learnings removed

Learnt from: fpenezic Repo: netbirdio/netbird PR: 5866 File: client/internal/peer/worker_relay.go:66-75 Timestamp: 2026-04-21T06:14:08.311Z Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` is only invoked by the Guard component after it has detected a disconnect (WG handshake timeout, ICE disconnect, etc.). Therefore, when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer`, the existing relay conn entry is stale by contract — Guard serializes offers post-disconnect, so the collision cannot be from a racing healthy connection. Unconditionally calling `CloseConnByPeerKey` and retrying `OpenConn` is the correct and intentional behavior. Do not flag this as a potential teardown of a healthy relay connection.

🧠 Learnings used

Learnt from: fpenezic Repo: netbirdio/netbird PR: 5866 File: client/internal/peer/worker_relay.go:66-75 Timestamp: 2026-04-21T06:14:08.311Z Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` is only invoked by the Guard component after it has detected a disconnect (WG handshake timeout, ICE disconnect, etc.). Therefore, when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer`, the existing relay conn entry is stale by contract — Guard serializes offers post-disconnect, so the collision cannot be from a racing healthy connection. Unconditionally calling `CloseConnByPeerKey` and retrying `OpenConn` is the correct and intentional behavior. Do not flag this as a potential teardown of a healthy relay connection.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5826 File: client/internal/portforward/manager.go:35-38 Timestamp: 2026-04-08T14:55:48.251Z Learning: In `client/internal/portforward/manager.go` (netbirdio/netbird), persisting permanent-lease (ttl=0/UPnP error 725 fallback) port mapping state for crash-recovery cleanup is intentionally deferred to a follow-up PR. The blocker is that `State.Cleanup` requires NAT gateway re-discovery via `nat.DiscoverGateway`, which can block startup for ~10 seconds when no gateway is present, affecting all clients. The TODO comment at line ~35 documents this constraint. Do not flag the missing state persistence for permanent leases as a blocking issue in this PR.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/server/account.go:2549-2595 Timestamp: 2026-04-08T06:14:34.119Z Learning: In `management/server/account.go` (netbirdio/netbird, PR `#5698`), `updatePeerIPv6InTransaction` intentionally does NOT apply the same host-part filtering that `AllocateRandomPeerIPv6()` uses (e.g., avoiding the all-zero host / subnet-router anycast address per RFC 4291 §2.6.1). IPv6 has no broadcast address, and subnet-router anycast is irrelevant for overlay addresses. The allocator avoids these addresses by convention only; an admin explicitly setting a specific peer IPv6 via the API is permitted to use any valid address within the account's v6 network range. Do not flag the absence of allocator-style host-part validation in `updatePeerIPv6InTransaction` as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5688 File: client/firewall/uspfilter/filter.go:1281-1292 Timestamp: 2026-03-25T09:21:25.418Z Learning: In netbirdio/netbird (`client/firewall/uspfilter/filter.go`), IPv4 and IPv6 fragments are intentionally allowed through (fast-path `return false` in `filterInbound`) without going through `peerACLsBlock`, `routeACLsPass`, or the forwarder. This is pre-existing behaviour tracked by the TODO comment "pass fragments of routed packets to forwarder". Do not flag the IPv6 fragment fast-path (`d.ip6.NextHeader == layers.IPProtocolIPv6Fragment` returning `(true, true)`) as a security regression; it mirrors the existing IPv4 fragment handling by design.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/internals/shared/grpc/conversion.go:0-0 Timestamp: 2026-03-26T09:09:54.859Z Learning: In netbirdio/netbird (management/internals/shared/grpc/conversion.go), the two IPv6 guards in `toPeerConfig` and `ToSyncResponse` serve different purposes and must NOT be unified. `toPeerConfig` uses `peer.IPv6.IsValid() && network.NetV6.IP != nil` because it constructs a prefix from the network mask and needs to guard against a nil mask. `ToSyncResponse` uses `peer.SupportsIPv6() && peer.IPv6.IsValid()` for `includeIPv6` and does not need the network object. Additionally, `peer.IPv6.IsValid() == true` is already proof that the peer is in an `IPv6EnabledGroups` group — IPv6 is only allocated to peers in those groups — so no additional group-membership gate is needed at the gRPC conversion layer. Do not flag these two distinct checks as a unification opportunity.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5891 File: util/capture/filter.go:164-183 Timestamp: 2026-04-15T10:26:03.137Z Learning: In `util/capture/filter.go` (netbirdio/netbird, PR `#5891`), `parseIPv4Info` does not guard L4 port extraction on fragment offset == 0. Non-initial IPv4 fragments will have their payload bytes misread as src/dst ports, which can cause spurious matches for port-based filters. The fix (checking `binary.BigEndian.Uint16(data[6:8]) & 0x1fff == 0` before reading ports) was deferred to a future PR. Do not treat the missing fragment-offset guard as a blocking issue for PR `#5891`.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5668 File: client/firewall/create_linux.go:45-59 Timestamp: 2026-03-23T19:24:39.071Z Learning: In netbirdio/netbird (client/firewall/create_linux.go), `AllowNetbird()` on the native firewall (nftables/iptables) adds blanket accept rules. These rules should ONLY be called when USPFilter wraps the native firewall as a backend (in-process filtering): USPFilter does per-peer ACL enforcement and needs the native firewall to pass all traffic through. When the native firewall is used directly (returned without USPFilter wrapping), `AllowNetbird()` must NOT be called, because the native firewall manages its own ACL rules and blanket accepts would bypass them. Do not flag the absence of `AllowNetbird()` on the native-only return path as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5652 File: client/cmd/status.go:221-221 Timestamp: 2026-03-22T18:18:11.890Z Learning: In netbirdio/netbird (`client/cmd/status.go`), `checkStartup` reads relay availability from the status recorder, which is updated in real time by the relay client on connect/disconnect. `ShouldRunProbes` only gates STUN/TURN reachability checks (blocking probes via `ProbeAllWaitResult`). Since `checkStartup` does not evaluate STUN/TURN probe results, passing `ShouldRunProbes=false` for startup health checks is intentional and correct — it avoids blocking latency with no benefit. Do not flag `getStatus(ctx, isStartup, false)` in `runHealthCheck` as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5873 File: client/firewall/uspfilter/hooks_filter.go:46-57 Timestamp: 2026-04-13T13:47:11.779Z Learning: In `client/firewall/uspfilter/hooks_filter.go` (netbirdio/netbird, PR `#5873`), `HooksFilter.FilterOutbound` intentionally does not skip fragmented IPv4 packets or validate `ihl >= 20` before reading the L4 destination port. The rationale: the hook only fires when both `dstIP` AND `dstPort` match the registered DNS hook (a specific IP + port 53). For non-first fragments, `dstPort` would be read from payload bytes — the probability of those bytes equalling 53 while simultaneously matching the DNS server IP is negligible. Even if such a spurious match occurred, the hook callback would reject the garbage data as a malformed DNS packet. The fragment-skip guard is unnecessary complexity for this minimal hook-only filter. Do not re-flag the absence of fragment or IHL bounds checks in `HooksFilter` as a safety issue.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5926 File: client/internal/sleep/detector_darwin.go:147-213 Timestamp: 2026-04-20T05:15:45.913Z Learning: In netbirdio/netbird, `SleepHandler` (client/server/sleep.go) already serializes sleep/wake lifecycle events, so a per-`Detector` callback mutex in `client/internal/sleep/detector_darwin.go` is intentionally omitted as redundant. Do not flag concurrent callback invocations on `Detector.triggerCallback` as a bug; the serialization guarantee lives in the layer above.

Learnt from: pappz Repo: netbirdio/netbird PR: 5770 File: relay/server/listener/quic/conn.go:34-38 Timestamp: 2026-04-01T18:47:37.952Z Learning: In netbirdio/netbird relay/server/listener/quic/conn.go, `Conn.Write` intentionally ignores the context parameter (`_ context.Context`) for QUIC `SendDatagram` calls. The design guarantees that the underlying QUIC session is always closed before the associated context is canceled, so `SendDatagram` will return a connection-closed error naturally without requiring a `ctx.Err()` pre-check. Do not flag the unused context in `Conn.Write` as a bug in this codebase.

Learnt from: bcmmbaga Repo: netbirdio/netbird PR: 5441 File: management/server/user.go:745-749 Timestamp: 2026-02-24T19:32:13.189Z Learning: In Go codebases like netbirdio/netbird, methods returning (T, error) should follow the convention: if an error is returned, propagate it and return early. When the caller handles the error immediately, explicit nil checks on the returned value are unnecessary; rely on the error to guard control flow and only inspect the value when err is nil. This guidance applies broadly to Store methods such as GetUserByUserID in Go code; prefer early return on error and avoid redundant nil checks after a successful error path.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5530 File: proxy/internal/udp/relay.go:276-278 Timestamp: 2026-03-07T13:59:16.506Z Learning: For Go 1.25+ projects, use sync.WaitGroup.Go() to run goroutines because it handles Add(1) and defer Done() automatically. Do not flag wg.Go(...) as a compilation error in codebases that require Go 1.25 or newer. If supporting older Go versions, gate such usage with build tags or avoid wg.Go() to maintain compatibility; verify the module targets Go 1.25+ before adopting this pattern.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5688 File: client/firewall/uspfilter/forwarder/icmp.go:268-287 Timestamp: 2026-03-25T09:21:53.814Z Learning: When using gVisor’s `header.ICMPv6Checksum(ICMPv6ChecksumParams)` (`gvisor.dev/gvisor/pkg/tcpip/header`), the pseudo-header checksum is computed internally from `Src`, `Dst`, and `Header`. Use `PayloadCsum`/`PayloadLen` only for scattered-buffer cases where the ICMPv6 payload bytes are not included in `Header`. If `Header` already contains the complete ICMPv6 message (ICMPv6 header + all data), call `ICMPv6Checksum` with `PayloadCsum: 0` and `PayloadLen: 0` (or omit those fields). Do not pass a separately computed pseudo-header checksum (e.g., `header.PseudoHeaderChecksum(...)`) as `PayloadCsum` in that case, because it will be double-counted and produce an incorrect checksum.

Thanks for the concrete options - went through all three.

Option A (explicit stale flag) - cleanest surface, but has a gap: if the remote peer experiences a NAT IP change and re-sends an offer before our side detects the disconnect locally (e.g. our WG handshake watcher hasn't timed out yet), the flag will be false on our end and we'd skip the close - leaving us with the original bug. The flag correctly covers the case where we initiated the recovery (onWGDisconnected with priority == Relay), but not the case where the remote side's state changed and ours is lagging behind.

Option B (health probe) - I don't think this works for our setup. The relay conn is a proxied UDP pipe; a zero-byte read blocks until a packet arrives, and a write to the local socket will always succeed regardless of whether the relay server still has a live pipe to the remote peer. There's no application-layer liveness signal on the relay conn itself - the authoritative signal is the WG handshake, which is what the original bug is about. Happy to be corrected if I'm missing a cheap probe.

Option C (generation counter) - agreed this is the most robust, but it's a wire-protocol change that would need coordination across signal/management and older clients. Too broad for this PR.

Proposal - hybrid A + per-peer serialization:

Explicit relayConnStale atomic.Bool on WorkerRelay, set in two places:

Conn.onWGDisconnected when currentConnPriority == Relay (we detected the disconnect)

Defensively after any explicit CloseConn() path

OnNewOffer only takes the close-and-retry path when the flag is set, then clears it.

Per-peer mutex on OnNewOffer so rapid back-to-back offers from the remote can't overlap with an in-flight close/reopen - the second offer waits for the first to finish, sees the fresh conn, and OpenConn returns an existing-but-healthy conn which we can reuse.

Short cooldown (e.g. 5s) on the flag to prevent pathological loops if the flag somehow gets stuck in true after a successful reopen.

This trades off completeness (we no longer auto-recover from a remote-initiated state change we haven't noticed yet) for correctness (we stop tearing down healthy connections). The original bug was narrower than I framed it - it's specifically about our side holding a stale entry after remote went through a transport-layer change we should have observed but didn't. The flag captures exactly that window.

Does this match what you had in mind for Option A, or were you thinking of a different signal source for setting the flag?

@fpenezic, the analysis is sharp and the proposal is sound. A few thoughts:

On the gap in Option A: You're right that it's real, but it's acceptable. In the remote-initiated NAT change where our WGWatcher hasn't fired yet, the worst case is now "delayed recovery (WGWatcher timeout, typically ~3 min)" instead of "indefinite tunnel death." WGWatcher will eventually fire, set the flag, and the next offer cycle recovers. That's a meaningful improvement over the status quo.

Additional stale flag signal: onRelayClientDisconnected → onRelayDisconnected is another natural place to set the flag. When the relay server does send a close event (the happy path), that definitively marks the conn as stale — no WGWatcher timeout needed. This closes part of the gap without adding complexity:

// in onRelayClientDisconnected or wherever onRelayDisconnected is handled w.relayConnStale.Store(true) go w.conn.onRelayDisconnected()

This way the flag is set by three events: our WGWatcher fires, the relay server sends a close, or CloseConn() is called directly.

On the cooldown: If the flag is always cleared atomically inside OnNewOffer immediately after a successful reopen, there's no stuck-flag scenario — the only way it stays true is if OpenConn fails, in which case the conn didn't get re-established anyway. I'd skip the 5s cooldown and rely on compareAndSwap-style clear-on-success instead.

On the per-peer mutex: Worth checking whether OnNewOffer is already serialized by its caller. If the signal handler dispatches offers for a given peer on a single goroutine, the mutex is redundant. If offers from different signal messages can race (e.g., separate goroutines), the mutex is necessary. Do you know the call chain? If it's already serialized, you can drop step 3 entirely and simplify.

Overall the hybrid A approach is the right call — it's correct (no healthy-conn teardown) and recovers from the common case (we detect the disconnect first) without the protocol complexity of Option C.

🧠 Learnings used

Learnt from: fpenezic Repo: netbirdio/netbird PR: 5866 File: client/internal/peer/worker_relay.go:71-80 Timestamp: 2026-04-22T22:05:59.868Z Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` fires on ALL offer/answer exchanges (normal re-keying included), NOT only after Guard detects a disconnect. Unconditionally calling `CloseConnByPeerKey` when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer` causes healthy relay connections to be torn down, producing an infinite reconnect loop (observed at 3600+ iterations/hour in multi-peer scenarios). A proper staleness check (e.g., an explicit stale flag set by WGWatcher before queuing the offer, or a health probe on the existing conn) is required before closing the existing connection.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5826 File: client/internal/portforward/manager.go:35-38 Timestamp: 2026-04-08T14:55:48.251Z Learning: In `client/internal/portforward/manager.go` (netbirdio/netbird), persisting permanent-lease (ttl=0/UPnP error 725 fallback) port mapping state for crash-recovery cleanup is intentionally deferred to a follow-up PR. The blocker is that `State.Cleanup` requires NAT gateway re-discovery via `nat.DiscoverGateway`, which can block startup for ~10 seconds when no gateway is present, affecting all clients. The TODO comment at line ~35 documents this constraint. Do not flag the missing state persistence for permanent leases as a blocking issue in this PR.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5652 File: client/cmd/status.go:221-221 Timestamp: 2026-03-22T18:18:11.890Z Learning: In netbirdio/netbird (`client/cmd/status.go`), `checkStartup` reads relay availability from the status recorder, which is updated in real time by the relay client on connect/disconnect. `ShouldRunProbes` only gates STUN/TURN reachability checks (blocking probes via `ProbeAllWaitResult`). Since `checkStartup` does not evaluate STUN/TURN probe results, passing `ShouldRunProbes=false` for startup health checks is intentional and correct — it avoids blocking latency with no benefit. Do not flag `getStatus(ctx, isStartup, false)` in `runHealthCheck` as a bug.

Learnt from: pappz Repo: netbirdio/netbird PR: 5770 File: relay/server/listener/quic/conn.go:34-38 Timestamp: 2026-04-01T18:47:37.952Z Learning: In netbirdio/netbird relay/server/listener/quic/conn.go, `Conn.Write` intentionally ignores the context parameter (`_ context.Context`) for QUIC `SendDatagram` calls. The design guarantees that the underlying QUIC session is always closed before the associated context is canceled, so `SendDatagram` will return a connection-closed error naturally without requiring a `ctx.Err()` pre-check. Do not flag the unused context in `Conn.Write` as a bug in this codebase.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5668 File: client/firewall/create_linux.go:45-59 Timestamp: 2026-03-23T19:24:39.071Z Learning: In netbirdio/netbird (client/firewall/create_linux.go), `AllowNetbird()` on the native firewall (nftables/iptables) adds blanket accept rules. These rules should ONLY be called when USPFilter wraps the native firewall as a backend (in-process filtering): USPFilter does per-peer ACL enforcement and needs the native firewall to pass all traffic through. When the native firewall is used directly (returned without USPFilter wrapping), `AllowNetbird()` must NOT be called, because the native firewall manages its own ACL rules and blanket accepts would bypass them. Do not flag the absence of `AllowNetbird()` on the native-only return path as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/internals/shared/grpc/conversion.go:0-0 Timestamp: 2026-03-26T09:09:54.859Z Learning: In netbirdio/netbird (management/internals/shared/grpc/conversion.go), the two IPv6 guards in `toPeerConfig` and `ToSyncResponse` serve different purposes and must NOT be unified. `toPeerConfig` uses `peer.IPv6.IsValid() && network.NetV6.IP != nil` because it constructs a prefix from the network mask and needs to guard against a nil mask. `ToSyncResponse` uses `peer.SupportsIPv6() && peer.IPv6.IsValid()` for `includeIPv6` and does not need the network object. Additionally, `peer.IPv6.IsValid() == true` is already proof that the peer is in an `IPv6EnabledGroups` group — IPv6 is only allocated to peers in those groups — so no additional group-membership gate is needed at the gRPC conversion layer. Do not flag these two distinct checks as a unification opportunity.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/server/account.go:2549-2595 Timestamp: 2026-04-08T06:14:34.119Z Learning: In `management/server/account.go` (netbirdio/netbird, PR `#5698`), `updatePeerIPv6InTransaction` intentionally does NOT apply the same host-part filtering that `AllocateRandomPeerIPv6()` uses (e.g., avoiding the all-zero host / subnet-router anycast address per RFC 4291 §2.6.1). IPv6 has no broadcast address, and subnet-router anycast is irrelevant for overlay addresses. The allocator avoids these addresses by convention only; an admin explicitly setting a specific peer IPv6 via the API is permitted to use any valid address within the account's v6 network range. Do not flag the absence of allocator-style host-part validation in `updatePeerIPv6InTransaction` as a bug.

Learnt from: bcmmbaga Repo: netbirdio/netbird PR: 5441 File: management/server/user.go:745-749 Timestamp: 2026-02-24T19:32:13.189Z Learning: In Go codebases like netbirdio/netbird, methods returning (T, error) should follow the convention: if an error is returned, propagate it and return early. When the caller handles the error immediately, explicit nil checks on the returned value are unnecessary; rely on the error to guard control flow and only inspect the value when err is nil. This guidance applies broadly to Store methods such as GetUserByUserID in Go code; prefer early return on error and avoid redundant nil checks after a successful error path.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5530 File: proxy/internal/udp/relay.go:276-278 Timestamp: 2026-03-07T13:59:16.506Z Learning: For Go 1.25+ projects, use sync.WaitGroup.Go() to run goroutines because it handles Add(1) and defer Done() automatically. Do not flag wg.Go(...) as a compilation error in codebases that require Go 1.25 or newer. If supporting older Go versions, gate such usage with build tags or avoid wg.Go() to maintain compatibility; verify the module targets Go 1.25+ before adopting this pattern.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5688 File: client/firewall/uspfilter/forwarder/icmp.go:268-287 Timestamp: 2026-03-25T09:21:53.814Z Learning: When using gVisor’s `header.ICMPv6Checksum(ICMPv6ChecksumParams)` (`gvisor.dev/gvisor/pkg/tcpip/header`), the pseudo-header checksum is computed internally from `Src`, `Dst`, and `Header`. Use `PayloadCsum`/`PayloadLen` only for scattered-buffer cases where the ICMPv6 payload bytes are not included in `Header`. If `Header` already contains the complete ICMPv6 message (ICMPv6 header + all data), call `ICMPv6Checksum` with `PayloadCsum: 0` and `PayloadLen: 0` (or omit those fields). Do not pass a separately computed pseudo-header checksum (e.g., `header.PseudoHeaderChecksum(...)`) as `PayloadCsum` in that case, because it will be double-counted and produce an incorrect checksum.

Verified the call chain - AsyncOfferListener already serializes per-peer (one goroutine at a time; newer offers during processing just update latest and run after current completes). So step 3 is redundant as you suspected. Dropping the mutex.

Final plan:

relayConnStale atomic.Bool on WorkerRelay

Set in three places:

Conn.onWGDisconnected (priority == Relay branch) - local WG watcher fired

onRelayClientDisconnected - relay server sent close event

WorkerRelay.CloseConn() - any explicit teardown

OnNewOffer: close-and-retry only when flag is set, clear on successful reopen (compareAndSwap)

No cooldown, no mutex.

The tradeoff: remote-initiated NAT change where our WG watcher hasn't fired yet means delayed recovery (~WG handshake timeout) instead of tunnel death. Net improvement over current master. Pushing shortly.

@fpenezic, the plan is solid. A few notes for the implementation:

On the CAS: Since AsyncOfferListener already serializes OnNewOffer per-peer, there's no concurrent accessor on relayConnStale from that direction — a plain Load() / Store(false) is race-free and equivalent to CAS here. The only concurrent writers are onRelayClientDisconnected (goroutine) and CloseConn() (called from various places), but those are write-only (always Store(true)), so a torn read isn't possible. No need for the overhead of a CAS loop.

One edge case to double-check: onRelayClientDisconnected spawns go w.conn.onRelayDisconnected() — so the flag set and the offer dispatch are on separate goroutines. Since AsyncOfferListener already handles the "offer arrives before flag is set" race by queuing and re-running, this shouldn't be an issue, but worth a quick trace to confirm the ordering is guaranteed on that path.

CloseConn() as a signal source: Note that CloseConn() is also called inside onRelayConnectionIsReady if AddCloseListener fails. That path will correctly mark the conn stale before returning, which is the right behavior.

Overall the plan is correct and the tradeoff (delayed recovery on remote-side NAT change we haven't observed yet, vs. infinite reconnect loop) is clearly the right call. Looking forward to the push.

🧠 Learnings used

Learnt from: fpenezic Repo: netbirdio/netbird PR: 5866 File: client/internal/peer/worker_relay.go:71-80 Timestamp: 2026-04-22T22:05:59.868Z Learning: In netbirdio/netbird (`client/internal/peer/worker_relay.go`), `WorkerRelay.OnNewOffer` fires on ALL offer/answer exchanges (normal re-keying included), NOT only after Guard detects a disconnect. Unconditionally calling `CloseConnByPeerKey` when `OpenConn` returns `ErrConnAlreadyExists` inside `OnNewOffer` causes healthy relay connections to be torn down, producing an infinite reconnect loop (observed at 3600+ iterations/hour in multi-peer scenarios). A proper staleness check (e.g., an explicit stale flag set by WGWatcher before queuing the offer, or a health probe on the existing conn) is required before closing the existing connection.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5826 File: client/internal/portforward/manager.go:35-38 Timestamp: 2026-04-08T14:55:48.251Z Learning: In `client/internal/portforward/manager.go` (netbirdio/netbird), persisting permanent-lease (ttl=0/UPnP error 725 fallback) port mapping state for crash-recovery cleanup is intentionally deferred to a follow-up PR. The blocker is that `State.Cleanup` requires NAT gateway re-discovery via `nat.DiscoverGateway`, which can block startup for ~10 seconds when no gateway is present, affecting all clients. The TODO comment at line ~35 documents this constraint. Do not flag the missing state persistence for permanent leases as a blocking issue in this PR.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5652 File: client/cmd/status.go:221-221 Timestamp: 2026-03-22T18:18:11.890Z Learning: In netbirdio/netbird (`client/cmd/status.go`), `checkStartup` reads relay availability from the status recorder, which is updated in real time by the relay client on connect/disconnect. `ShouldRunProbes` only gates STUN/TURN reachability checks (blocking probes via `ProbeAllWaitResult`). Since `checkStartup` does not evaluate STUN/TURN probe results, passing `ShouldRunProbes=false` for startup health checks is intentional and correct — it avoids blocking latency with no benefit. Do not flag `getStatus(ctx, isStartup, false)` in `runHealthCheck` as a bug.

Learnt from: pappz Repo: netbirdio/netbird PR: 5770 File: relay/server/listener/quic/conn.go:34-38 Timestamp: 2026-04-01T18:47:37.952Z Learning: In netbirdio/netbird relay/server/listener/quic/conn.go, `Conn.Write` intentionally ignores the context parameter (`_ context.Context`) for QUIC `SendDatagram` calls. The design guarantees that the underlying QUIC session is always closed before the associated context is canceled, so `SendDatagram` will return a connection-closed error naturally without requiring a `ctx.Err()` pre-check. Do not flag the unused context in `Conn.Write` as a bug in this codebase.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5668 File: client/firewall/create_linux.go:45-59 Timestamp: 2026-03-23T19:24:39.071Z Learning: In netbirdio/netbird (client/firewall/create_linux.go), `AllowNetbird()` on the native firewall (nftables/iptables) adds blanket accept rules. These rules should ONLY be called when USPFilter wraps the native firewall as a backend (in-process filtering): USPFilter does per-peer ACL enforcement and needs the native firewall to pass all traffic through. When the native firewall is used directly (returned without USPFilter wrapping), `AllowNetbird()` must NOT be called, because the native firewall manages its own ACL rules and blanket accepts would bypass them. Do not flag the absence of `AllowNetbird()` on the native-only return path as a bug.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/internals/shared/grpc/conversion.go:0-0 Timestamp: 2026-03-26T09:09:54.859Z Learning: In netbirdio/netbird (management/internals/shared/grpc/conversion.go), the two IPv6 guards in `toPeerConfig` and `ToSyncResponse` serve different purposes and must NOT be unified. `toPeerConfig` uses `peer.IPv6.IsValid() && network.NetV6.IP != nil` because it constructs a prefix from the network mask and needs to guard against a nil mask. `ToSyncResponse` uses `peer.SupportsIPv6() && peer.IPv6.IsValid()` for `includeIPv6` and does not need the network object. Additionally, `peer.IPv6.IsValid() == true` is already proof that the peer is in an `IPv6EnabledGroups` group — IPv6 is only allocated to peers in those groups — so no additional group-membership gate is needed at the gRPC conversion layer. Do not flag these two distinct checks as a unification opportunity.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5698 File: management/server/account.go:2549-2595 Timestamp: 2026-04-08T06:14:34.119Z Learning: In `management/server/account.go` (netbirdio/netbird, PR `#5698`), `updatePeerIPv6InTransaction` intentionally does NOT apply the same host-part filtering that `AllocateRandomPeerIPv6()` uses (e.g., avoiding the all-zero host / subnet-router anycast address per RFC 4291 §2.6.1). IPv6 has no broadcast address, and subnet-router anycast is irrelevant for overlay addresses. The allocator avoids these addresses by convention only; an admin explicitly setting a specific peer IPv6 via the API is permitted to use any valid address within the account's v6 network range. Do not flag the absence of allocator-style host-part validation in `updatePeerIPv6InTransaction` as a bug.

Learnt from: bcmmbaga Repo: netbirdio/netbird PR: 5441 File: management/server/user.go:745-749 Timestamp: 2026-02-24T19:32:13.189Z Learning: In Go codebases like netbirdio/netbird, methods returning (T, error) should follow the convention: if an error is returned, propagate it and return early. When the caller handles the error immediately, explicit nil checks on the returned value are unnecessary; rely on the error to guard control flow and only inspect the value when err is nil. This guidance applies broadly to Store methods such as GetUserByUserID in Go code; prefer early return on error and avoid redundant nil checks after a successful error path.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5530 File: proxy/internal/udp/relay.go:276-278 Timestamp: 2026-03-07T13:59:16.506Z Learning: For Go 1.25+ projects, use sync.WaitGroup.Go() to run goroutines because it handles Add(1) and defer Done() automatically. Do not flag wg.Go(...) as a compilation error in codebases that require Go 1.25 or newer. If supporting older Go versions, gate such usage with build tags or avoid wg.Go() to maintain compatibility; verify the module targets Go 1.25+ before adopting this pattern.

Learnt from: lixmal Repo: netbirdio/netbird PR: 5688 File: client/firewall/uspfilter/forwarder/icmp.go:268-287 Timestamp: 2026-03-25T09:21:53.814Z Learning: When using gVisor’s `header.ICMPv6Checksum(ICMPv6ChecksumParams)` (`gvisor.dev/gvisor/pkg/tcpip/header`), the pseudo-header checksum is computed internally from `Src`, `Dst`, and `Header`. Use `PayloadCsum`/`PayloadLen` only for scattered-buffer cases where the ICMPv6 payload bytes are not included in `Header`. If `Header` already contains the complete ICMPv6 message (ICMPv6 header + all data), call `ICMPv6Checksum` with `PayloadCsum: 0` and `PayloadLen: 0` (or omit those fields). Do not pass a separately computed pseudo-header checksum (e.g., `header.PseudoHeaderChecksum(...)`) as `PayloadCsum` in that case, because it will be double-counted and produce an incorrect checksum.

-Original file line number
+Diff line change
@@ Expand Up / @@ -617,9 +617,31 @@ func (conn *Conn) onWGDisconnected() { @@
     	// Close the active connection based on current priority
     	switch conn.currentConnPriority {
     	case conntype.Relay:
+    		// Mark the relay conn entry as stale so the next OnNewOffer closes
+    		// and reopens it instead of reusing a dead pipe. MarkStale covers
+    		// the case where CloseConn is a no-op (e.g. relayedConn already nil).
+    		conn.workerRelay.MarkStale()
     		conn.workerRelay.CloseConn()
     		conn.handleRelayDisconnectedLocked()
+    		// When running over relay, workerICE is not closed so its session ID is
+    		// never rotated. The next offer would carry the same session ID, causing
+    		// the remote peer to skip ICE agent recreation (it already has an agent
+    		// for that session) and reuse stale candidates — preventing recovery
+    		// after a NAT IP change (e.g. PPPoE reconnect). Force a new session ID
+    		// so the remote peer creates a fresh ICE agent with current candidates.
+    		if conn.workerICE != nil {
+    			conn.workerICE.ResetSessionID()
+    		}
     	case conntype.ICEP2P, conntype.ICETurn:
+    		// WorkerICE.Close() sets agent=nil before pion's ICE library fires
+    		// ConnectionStateClosed. By the time onConnectionStateChange runs
+    		// closeAgent(), the w.agent==agent guard fails and the session ID
+    		// is not rotated. Without rotation, the next offer carries the same
+    		// session ID and the remote peer skips ICE agent recreation in
+    		// OnNewOffer (sessionID match), reusing stale candidates from the
+    		// previous network state. Rotate explicitly here so the remote peer
+    		// always recreates its agent after a WG timeout on ICE.
+    		conn.workerICE.ResetSessionID()
     		conn.workerICE.Close()
     	default:
     		conn.Log.Debugf("No active connection to close on WG timeout")
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -249,6 +249,24 @@ func (w *WorkerICE) SessionID() ICESessionID { @@
     	return w.sessionID
     }
+    // ResetSessionID generates a new session ID and clears the remote session ID.
+    // This must be called when the WireGuard handshake times out while using a relay
+    // connection, so that the next ICE offer carries a fresh session ID and the remote
+    // peer recreates its ICE agent with up-to-date candidates instead of skipping the
+    // offer because the session ID matches the previous (failed) attempt.
+    func (w *WorkerICE) ResetSessionID() {
+    	w.muxAgent.Lock()
+    	defer w.muxAgent.Unlock()
+    	sessionID, err := NewICESessionID()
+    	if err != nil {
+    		w.log.Errorf("failed to create new session ID: %s", err)
+    		return
+    	}
+    	w.sessionID = sessionID
+    	w.remoteSessionID = ""
+    }
     // will block until connection succeeded
     // but it won't release if ICE Agent went into Disconnected or Failed state,
     // so we have to cancel it with the provided context once agent detected a broken connection
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -30,6 +30,14 @@ type WorkerRelay struct { @@
     	relayLock   sync.Mutex
     	relaySupportedOnRemotePeer atomic.Bool
+    	// relayConnStale is set to true when an event indicates that the current
+    	// relay connection entry in the relay client's conns map is no longer
+    	// backed by a live peer session (e.g. local WG handshake timeout, relay
+    	// server close event, explicit CloseConn). When OnNewOffer observes
+    	// ErrConnAlreadyExists, it only closes the stale entry if this flag is
+    	// set; otherwise it bails out and reuses the existing healthy connection.
+    	relayConnStale atomic.Bool
     }
     func NewWorkerRelay(ctx context.Context, log *log.Entry, ctrl bool, config ConnConfig, conn *Conn, relayManager *relayClient.Manager) *WorkerRelay {
@@ Expand Down Expand Up @@
     	relayedConn, err := w.relayManager.OpenConn(w.peerCtx, srv, w.config.Key)
     	if err != nil {
     		if errors.Is(err, relayClient.ErrConnAlreadyExists) {
-    			w.log.Debugf("handled offer by reusing existing relay connection")
+    			// Only tear down the existing conn if something previously marked
+    			// it as stale (local WG handshake timeout, relay server close, or
+    			// explicit CloseConn). Without that signal, the existing conn is
+    			// assumed healthy and is reused — unconditional close on every
+    			// colliding offer causes an infinite tear-down/rebuild loop when
+    			// the remote peer sends rapid successive offers.
+    			if !w.relayConnStale.Load() {
+    				w.log.Debugf("relay conn already exists and is not marked stale, reusing")
+    				return
+    			}
+    			w.log.Infof("relay conn already exists and is marked stale, closing and retrying")
+    			w.relayManager.CloseConnByPeerKey(srv, w.config.Key)
+    			relayedConn, err = w.relayManager.OpenConn(w.peerCtx, srv, w.config.Key)
+    			if err != nil {
+    				w.log.Errorf("failed to reopen connection via Relay after closing stale: %s", err)
+    				return
+    			}
+    			w.relayConnStale.Store(false)
+    		} else {
+    			w.log.Errorf("failed to open connection via Relay: %s", err)
     			return
     		}
-    		w.log.Errorf("failed to open connection via Relay: %s", err)
-    		return
     	}
     	w.relayLock.Lock()
@@ Expand Down Expand Up / @@ -109,11 +134,21 @@ func (w *WorkerRelay) CloseConn() { @@
     		return
     	}
+    	w.relayConnStale.Store(true)
     	if err := w.relayedConn.Close(); err != nil {
     		w.log.Warnf("failed to close relay connection: %v", err)
     	}
     }
+    // MarkStale marks the relay connection entry as stale so that the next
+    // OnNewOffer call with ErrConnAlreadyExists will tear it down and open a
+    // fresh one. Callers signal staleness from outside the relay client path,
+    // e.g. when the local WG handshake watcher fires while the relay is the
+    // active transport.
+    func (w *WorkerRelay) MarkStale() {
+    	w.relayConnStale.Store(true)
+    }
     func (w *WorkerRelay) isRelaySupported(answer *OfferAnswer) bool {
     	if !w.relayManager.HasRelayAddress() {
     		return false
@@ Expand All @@
     }
     func (w *WorkerRelay) onRelayClientDisconnected() {
+    	w.relayConnStale.Store(true)
     	go w.conn.onRelayDisconnected()
     }

-Original file line number
+Diff line change
@@ Expand Up @@
     	return nil
     }
+    // CloseConnByPeerKey closes an existing relay connection for the given peer key,
+    // removing it from the internal connection map so a new one can be opened.
+    func (c *Client) CloseConnByPeerKey(peerKey string) {
+    	peerID := messages.HashID(peerKey)
+    	c.mu.Lock()
+    	container, ok := c.conns[peerID]
+    	if !ok {
+    		c.mu.Unlock()
+    		return
+    	}
+    	c.log.Infof("force closing stale relay connection for peer: %s", peerID)
+    	delete(c.conns, peerID)
+    	c.mu.Unlock()
+    	if err := c.stateSubscription.UnsubscribeStateChange([]messages.PeerID{peerID}); err != nil {
+    		c.log.Errorf("failed to unsubscribe from peer state change: %s", err)
+    	}
+    	container.close()
+    }
     func (c *Client) close(gracefullyExit bool) error {
     	c.readLoopMutex.Lock()
     	defer c.readLoopMutex.Unlock()
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up @@
     	return netConn, err
     }
+    // CloseConnByPeerKey closes an existing relay connection for the given peer key
+    // on the relay client associated with serverAddress, so that a subsequent
+    // OpenConn can create a fresh one.
+    func (m *Manager) CloseConnByPeerKey(serverAddress, peerKey string) {
+    	m.relayClientMu.RLock()
+    	homeClient := m.relayClient
+    	m.relayClientMu.RUnlock()
+    	if homeClient == nil {
+    		return
+    	}
+    	homeAddr, err := homeClient.ServerInstanceURL()
+    	if err == nil && homeAddr == serverAddress {
+    		homeClient.CloseConnByPeerKey(peerKey)
+    		return
+    	}
+    	m.relayClientsMutex.RLock()
+    	rt, ok := m.relayClients[serverAddress]
+    	m.relayClientsMutex.RUnlock()
+    	if ok && rt.relayClient != nil {
+    		rt.relayClient.CloseConnByPeerKey(peerKey)
+    	}
+    }
     // Ready returns true if the home Relay client is connected to the relay server.
     func (m *Manager) Ready() bool {
     	m.relayClientMu.RLock()
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: close stale relay connection on ErrConnAlreadyExists to recover … #5866

Uh oh!

Diff view

Diff view

There are no files selected for viewing

coderabbitai Bot Apr 20, 2026 •

edited

Loading

Uh oh!

fpenezic Apr 21, 2026

Uh oh!

coderabbitai Bot Apr 21, 2026

Uh oh!

fpenezic Apr 22, 2026

Uh oh!

coderabbitai Bot Apr 22, 2026

Uh oh!

fpenezic Apr 22, 2026

Uh oh!

coderabbitai Bot Apr 22, 2026

Uh oh!

fpenezic Apr 22, 2026

Uh oh!

coderabbitai Bot Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

fix: close stale relay connection on ErrConnAlreadyExists to recover … #5866

Are you sure you want to change the base?

Uh oh!

fix: close stale relay connection on ErrConnAlreadyExists to recover … #5866

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

coderabbitai Bot Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fpenezic Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

fpenezic Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

fpenezic Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

fpenezic Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 20, 2026 •

edited

Loading