fix(dns): skip probe for NetBird overlay addresses at engine startup#5625
fix(dns): skip probe for NetBird overlay addresses at engine startup#5625sbeaudry-qms wants to merge 3 commits intonetbirdio:mainfrom
Conversation
Nameservers configured with overlay IPs (100.64.0.0/10 CGNAT range) are only reachable once the WireGuard tunnel is established. Probing them at engine startup always fails with a false-positive warning even though they become available moments later. Skip the startup probe for any nameserver in the NetBird CGNAT range. They continue to be exercised by real DNS queries and the waitUntilResponse retry loop if an actual failure occurs.
📝 WalkthroughWalkthroughAdds NetBird CGNAT overlay detection (100.64.0.0/10) to skip probing those addresses, makes probing context-aware by passing base/external contexts, and introduces sync.WaitGroup and mutex synchronization for coordinated shutdown. Tests added to validate overlay skipping and probe behavior across scenarios. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Important Merge conflicts detected (Beta)
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
client/internal/dns/upstream.go (1)
445-458: TheexternalCtxparameter is unused in this PR.The
testNameserversignature was updated to accept anexternalCtxfor cancellation coordination, but all call sites passnil:
- Line 319:
u.testNameserver(u.ctx, nil, upstream, ...)- Line 380:
u.testNameserver(u.ctx, nil, upstream, ...)If this is preparation for future functionality, consider adding a brief comment. Otherwise, the simpler single-context signature would suffice for now.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 445 - 458, The externalCtx parameter in upstreamResolverBase.testNameserver is unused (all callers pass nil); remove the unused externalCtx parameter from the testNameserver signature and update all callers (e.g., the u.testNameserver(...) invocations that currently pass nil) to call the two-argument version (baseCtx, server, timeout) so the function uses only the merged context created with context.WithTimeout; alternatively, if you intended to support an external cancel context, add a brief comment in testNameserver explaining why externalCtx is accepted and wire real non-nil contexts from callers—prefer removing the unused parameter to keep the API minimal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 129-136: The Stop() method currently holds u.mutex then calls
u.wg.Wait(), causing a deadlock with goroutines spawned by disable() that call
waitUntilResponse() and need u.mutex; fix by releasing the lock before waiting
on the WaitGroup: move u.wg.Wait() to happen before acquiring u.mutex (or simply
unlock immediately after calling u.cancel() and call u.mutex.Lock()/Unlock()
only for protecting shared state modifications), ensuring
upstreamResolverBase.Stop() calls u.cancel() first, then waits on u.wg, and only
locks u.mutex when reading/modifying u.upstreamServers or other shared
fields—this removes the lock while waiting and prevents the deadlock between
Stop(), waitUntilResponse(), and the goroutine from disable().
---
Nitpick comments:
In `@client/internal/dns/upstream.go`:
- Around line 445-458: The externalCtx parameter in
upstreamResolverBase.testNameserver is unused (all callers pass nil); remove the
unused externalCtx parameter from the testNameserver signature and update all
callers (e.g., the u.testNameserver(...) invocations that currently pass nil) to
call the two-argument version (baseCtx, server, timeout) so the function uses
only the merged context created with context.WithTimeout; alternatively, if you
intended to support an external cancel context, add a brief comment in
testNameserver explaining why externalCtx is accepted and wire real non-nil
contexts from callers—prefer removing the unused parameter to keep the API
minimal.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 34bf3e38-39f8-44f2-a134-6b6863e5e8d3
📒 Files selected for processing (2)
client/internal/dns/upstream.goclient/internal/dns/upstream_test.go
…it() in Stop() to prevent deadlock
|
Regarding the CodeRabbit nitpick of the externalCtx parameter, it was pre-existing, not introduced by this PR, so I'm not addressing it. Cheers, |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 305-310: The code currently treats NetBird overlay addresses as
successful probes by setting success = true inside the
isNetBirdOverlayAddr(upstream) branch; remove the
mu.Lock()/success=true/mu.Unlock() lines so skipped overlay servers are simply
continued without marking success. Update the same change at the other
occurrence noted (the similar block around lines 340-341) so success is only set
when an actual probe succeeds and remains protected by mu where real success
assignments occur.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bcce1491-fad9-417b-aa19-423b8367575a
📒 Files selected for processing (1)
client/internal/dns/upstream.go
…tead of success=true for skipped overlay servers
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/upstream_test.go`:
- Around line 598-612: The tests use context.Background() for
upstreamResolverBase so when ProbeAvailability triggers deactivate() the
background context leaves waitUntilResponse() running and leaks goroutines; fix
both tests by creating the resolver with a cancellable context via
context.WithCancel (capture cancel) and register t.Cleanup(func(){ cancel();
resolver.Stop() }) or at minimum t.Cleanup(resolver.Stop) so the retry goroutine
started by waitUntilResponse() is stopped; update the tests that construct
upstreamResolverBase and the mixed-overlay failure test to use
context.WithCancel and t.Cleanup(resolver.Stop).
In `@client/internal/dns/upstream.go`:
- Around line 48-59: The current isNetBirdOverlayAddr/netbirdCGNATRange check
incorrectly treats every 100.64.0.0/10 resolver as a NetBird peer; change
isNetBirdOverlayAddr (and any callers) to only skip probing when the address is
both in netbirdCGNATRange AND is positively associated with a NetBird peer/route
(e.g., consult the local peer-to-IP mapping or routing table used by the engine
before returning true), otherwise return false so
DefaultServer.ProbeAvailability and engine startup will probe it normally;
update callers at the other noted sites (around lines ~303-310 and ~339-343) to
use the revised predicate; also add a regression test that creates a non-NetBird
nameserver in 100.64.0.0/10 and asserts it is probed and that groups are
disabled/emits the expected startup warning when no real upstream responds.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ef0deda3-7d57-4bd0-b618-5f4f91f1341d
📒 Files selected for processing (2)
client/internal/dns/upstream.goclient/internal/dns/upstream_test.go
| deactivated := false | ||
| resolver := &upstreamResolverBase{ | ||
| ctx: context.Background(), | ||
| upstreamClient: trackingClient, | ||
| upstreamServers: []netip.AddrPort{lanUpstream}, | ||
| upstreamTimeout: UpstreamTimeout, | ||
| deactivate: func(error) { deactivated = true }, | ||
| reactivate: func() {}, | ||
| } | ||
|
|
||
| resolver.ProbeAvailability() | ||
|
|
||
| assert.True(t, probed, "LAN address should have been probed") | ||
| assert.True(t, deactivated, "resolver should have been deactivated when LAN probe fails") | ||
| } |
There was a problem hiding this comment.
These failure-path tests leave the retry loop running.
Both resolvers use context.Background(). After ProbeAvailability() fails, disable() starts waitUntilResponse(), and that goroutine never exits because the mock never recovers and nothing cancels the context. That leaks work into the rest of the suite.
Use context.WithCancel here, set cancel, and register t.Cleanup(resolver.Stop) in both tests.
🧹 Suggested cleanup
- resolver := &upstreamResolverBase{
- ctx: context.Background(),
+ ctx, cancel := context.WithCancel(context.Background())
+ resolver := &upstreamResolverBase{
+ ctx: ctx,
+ cancel: cancel,
upstreamClient: trackingClient,
upstreamServers: []netip.AddrPort{lanUpstream},
upstreamTimeout: UpstreamTimeout,
deactivate: func(error) { deactivated = true },
reactivate: func() {},
}
+ t.Cleanup(resolver.Stop)Apply the same pattern to the mixed-overlay failure test below.
Also applies to: 694-706
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@client/internal/dns/upstream_test.go` around lines 598 - 612, The tests use
context.Background() for upstreamResolverBase so when ProbeAvailability triggers
deactivate() the background context leaves waitUntilResponse() running and leaks
goroutines; fix both tests by creating the resolver with a cancellable context
via context.WithCancel (capture cancel) and register t.Cleanup(func(){ cancel();
resolver.Stop() }) or at minimum t.Cleanup(resolver.Stop) so the retry goroutine
started by waitUntilResponse() is stopped; update the tests that construct
upstreamResolverBase and the mixed-overlay failure test to use
context.WithCancel and t.Cleanup(resolver.Stop).
| // netbirdCGNATRange is the CGNAT range (100.64.0.0/10) used by NetBird for | ||
| // overlay peer IP addresses. Addresses in this range are only reachable once | ||
| // the WireGuard tunnel to the peer is fully established. | ||
| var netbirdCGNATRange = netip.MustParsePrefix("100.64.0.0/10") | ||
|
|
||
| // isNetBirdOverlayAddr returns true if the given address is within the NetBird | ||
| // CGNAT range (100.64.0.0/10). Such addresses are only reachable after the | ||
| // WireGuard tunnel is established, so probing them at engine startup will | ||
| // always produce a false-positive "unable to reach" warning. | ||
| func isNetBirdOverlayAddr(addr netip.AddrPort) bool { | ||
| return netbirdCGNATRange.Contains(addr.Addr()) | ||
| } |
There was a problem hiding this comment.
Do not treat every 100.64.0.0/10 resolver as a NetBird peer.
100.64.0.0/10 is shared CGNAT space, not NetBird-exclusive. With the current check, any resolver in that block is skipped, so probedAny stays false and the group never disables or emits the startup warning even if that address is just a local/ISP resolver that's actually down. DefaultServer.ProbeAvailability() and engine startup both rely on this path to disable groups when no real upstream responds.
At minimum, only skip when you can positively map the address to a NetBird peer/route; otherwise keep the normal probe behavior. Please also add a regression case for a non-NetBird 100.64/10 nameserver.
💡 Safer gating direction
-func isNetBirdOverlayAddr(addr netip.AddrPort) bool {
- return netbirdCGNATRange.Contains(addr.Addr())
+func (u *upstreamResolverBase) shouldSkipStartupProbe(addr netip.AddrPort) bool {
+ if !netbirdCGNATRange.Contains(addr.Addr()) {
+ return false
+ }
+ return findPeerForIP(addr.Addr(), u.statusRecorder) != nil
}
@@
- if isNetBirdOverlayAddr(upstream) {
+ if u.shouldSkipStartupProbe(upstream) {
log.Debugf("skipping probe for NetBird overlay nameserver %s (tunnel not yet established)", upstream)
continue
}Also applies to: 303-310, 339-343
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@client/internal/dns/upstream.go` around lines 48 - 59, The current
isNetBirdOverlayAddr/netbirdCGNATRange check incorrectly treats every
100.64.0.0/10 resolver as a NetBird peer; change isNetBirdOverlayAddr (and any
callers) to only skip probing when the address is both in netbirdCGNATRange AND
is positively associated with a NetBird peer/route (e.g., consult the local
peer-to-IP mapping or routing table used by the engine before returning true),
otherwise return false so DefaultServer.ProbeAvailability and engine startup
will probe it normally; update callers at the other noted sites (around lines
~303-310 and ~339-343) to use the revised predicate; also add a regression test
that creates a non-NetBird nameserver in 100.64.0.0/10 and asserts it is probed
and that groups are disabled/emits the expected startup warning when no real
upstream responds.
|
Superseded by #5971, which takes a different approach: rather than skipping probes for overlay IPs, it drops the synthetic probe entirely and derives group health from real query outcomes. Thanks for putting this together — the problem statement in your description informed the design. |



Describe your changes
When a nameserver is configured with a NetBird overlay IP (in the
100.64.0.0/10CGNAT range),ProbeAvailability()is called at engine startup before the WireGuard tunnel to that peer is established. The probe always fails at this point — not because the nameserver is down, but because the tunnel isn't up yet. This produces a spurious "Unable to reach one or more DNS servers" warning innetbird status -deven though DNS resolves correctly once the connection settles.This change skips the startup probe for any nameserver whose IP falls within the NetBird CGNAT range. A debug log message is emitted instead. These nameservers continue to be exercised normally by real DNS queries, and the
waitUntilResponseretry loop still handles genuine failures.Non-overlay nameservers (LAN IPs, public DNS) are unaffected and continue to be probed as before.
Issue ticket number and link
N/A — no existing issue. This addresses a known annoyance discussed in the NetBird community forum: https://forum.netbird.io/t/unable-to-reach-one-or-more-dns-servers/204
Stack
Checklist
Documentation
This is a bug fix for a false-positive warning with no user-facing behaviour change. DNS resolution is unaffected.
Summary by CodeRabbit
Bug Fixes
Tests