Skip to content

fix(dns): skip probe for NetBird overlay addresses at engine startup#5625

Closed
sbeaudry-qms wants to merge 3 commits intonetbirdio:mainfrom
sbeaudry-qms:fix/dns-skip-overlay-probe
Closed

fix(dns): skip probe for NetBird overlay addresses at engine startup#5625
sbeaudry-qms wants to merge 3 commits intonetbirdio:mainfrom
sbeaudry-qms:fix/dns-skip-overlay-probe

Conversation

@sbeaudry-qms
Copy link
Copy Markdown

@sbeaudry-qms sbeaudry-qms commented Mar 18, 2026

Describe your changes

When a nameserver is configured with a NetBird overlay IP (in the 100.64.0.0/10 CGNAT range), ProbeAvailability() is called at engine startup before the WireGuard tunnel to that peer is established. The probe always fails at this point — not because the nameserver is down, but because the tunnel isn't up yet. This produces a spurious "Unable to reach one or more DNS servers" warning in netbird status -d even though DNS resolves correctly once the connection settles.

This change skips the startup probe for any nameserver whose IP falls within the NetBird CGNAT range. A debug log message is emitted instead. These nameservers continue to be exercised normally by real DNS queries, and the waitUntilResponse retry loop still handles genuine failures.

Non-overlay nameservers (LAN IPs, public DNS) are unaffected and continue to be probed as before.

Issue ticket number and link

N/A — no existing issue. This addresses a known annoyance discussed in the NetBird community forum: https://forum.netbird.io/t/unable-to-reach-one-or-more-dns-servers/204

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

Documentation

  • Documentation is not needed for this change (explain why)

This is a bug fix for a false-positive warning with no user-facing behaviour change. DNS resolution is unaffected.

Summary by CodeRabbit

  • Bug Fixes

    • Skip CGNAT overlay (100.64.0.0/10) addresses during startup probes to avoid false-positive DNS failures; skipping is logged and treated as passed.
    • Improve shutdown coordination and concurrency safety for probing and disabling to ensure clean stop behavior.
    • Aggregate probe errors and strengthen context-aware probing/cancellation for more reliable failure reporting.
  • Tests

    • Added comprehensive tests covering overlay detection and probing behavior across LAN, public, mixed, and overlay-only scenarios.

Nameservers configured with overlay IPs (100.64.0.0/10 CGNAT range) are
only reachable once the WireGuard tunnel is established. Probing them at
engine startup always fails with a false-positive warning even though they
become available moments later.

Skip the startup probe for any nameserver in the NetBird CGNAT range.
They continue to be exercised by real DNS queries and the waitUntilResponse
retry loop if an actual failure occurs.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 18, 2026

📝 Walkthrough

Walkthrough

Adds NetBird CGNAT overlay detection (100.64.0.0/10) to skip probing those addresses, makes probing context-aware by passing base/external contexts, and introduces sync.WaitGroup and mutex synchronization for coordinated shutdown. Tests added to validate overlay skipping and probe behavior across scenarios.

Changes

Cohort / File(s) Summary
DNS Upstream Resolver Core
client/internal/dns/upstream.go
Adds isNetBirdOverlayAddr (100.64.0.0/10) and skips probing overlay addresses; updates testNameserver signature to accept baseCtx and externalCtx and propagates contexts; introduces wg (sync.WaitGroup) and mutex-protected disabled coordination; aggregates probe errors and adjusts disable/waitUntilResponse coordination.
DNS Upstream Tests
client/internal/dns/upstream_test.go
Adds unit tests for overlay detection and ProbeAvailability behavior (overlay skip, LAN probing, mixed cases, overlay-only behavior); introduces tracking helpers and visitor wrappers to record probe activity per upstream address.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • lixmal
  • pappz

Poem

🐇 I sniff the nets where overlays hide,

100.64 tucked safe inside,
I skip those hops and mind my pace,
with contexts held and waitgroups' grace,
Hoppy probes—DNS with pride! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: skipping DNS probes for NetBird overlay addresses at startup.
Description check ✅ Passed The description covers the problem context, solution approach, testing, and documentation status, matching the template requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Important

Merge conflicts detected (Beta)

  • Resolve merge conflict in branch fix/dns-skip-overlay-probe
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
client/internal/dns/upstream.go (1)

445-458: The externalCtx parameter is unused in this PR.

The testNameserver signature was updated to accept an externalCtx for cancellation coordination, but all call sites pass nil:

  • Line 319: u.testNameserver(u.ctx, nil, upstream, ...)
  • Line 380: u.testNameserver(u.ctx, nil, upstream, ...)

If this is preparation for future functionality, consider adding a brief comment. Otherwise, the simpler single-context signature would suffice for now.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 445 - 458, The externalCtx
parameter in upstreamResolverBase.testNameserver is unused (all callers pass
nil); remove the unused externalCtx parameter from the testNameserver signature
and update all callers (e.g., the u.testNameserver(...) invocations that
currently pass nil) to call the two-argument version (baseCtx, server, timeout)
so the function uses only the merged context created with context.WithTimeout;
alternatively, if you intended to support an external cancel context, add a
brief comment in testNameserver explaining why externalCtx is accepted and wire
real non-nil contexts from callers—prefer removing the unused parameter to keep
the API minimal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 129-136: The Stop() method currently holds u.mutex then calls
u.wg.Wait(), causing a deadlock with goroutines spawned by disable() that call
waitUntilResponse() and need u.mutex; fix by releasing the lock before waiting
on the WaitGroup: move u.wg.Wait() to happen before acquiring u.mutex (or simply
unlock immediately after calling u.cancel() and call u.mutex.Lock()/Unlock()
only for protecting shared state modifications), ensuring
upstreamResolverBase.Stop() calls u.cancel() first, then waits on u.wg, and only
locks u.mutex when reading/modifying u.upstreamServers or other shared
fields—this removes the lock while waiting and prevents the deadlock between
Stop(), waitUntilResponse(), and the goroutine from disable().

---

Nitpick comments:
In `@client/internal/dns/upstream.go`:
- Around line 445-458: The externalCtx parameter in
upstreamResolverBase.testNameserver is unused (all callers pass nil); remove the
unused externalCtx parameter from the testNameserver signature and update all
callers (e.g., the u.testNameserver(...) invocations that currently pass nil) to
call the two-argument version (baseCtx, server, timeout) so the function uses
only the merged context created with context.WithTimeout; alternatively, if you
intended to support an external cancel context, add a brief comment in
testNameserver explaining why externalCtx is accepted and wire real non-nil
contexts from callers—prefer removing the unused parameter to keep the API
minimal.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 34bf3e38-39f8-44f2-a134-6b6863e5e8d3

📥 Commits

Reviewing files that changed from the base of the PR and between a1858a9 and 434ad13.

📒 Files selected for processing (2)
  • client/internal/dns/upstream.go
  • client/internal/dns/upstream_test.go

Comment thread client/internal/dns/upstream.go
@sbeaudry-qms
Copy link
Copy Markdown
Author

Regarding the CodeRabbit nitpick of the externalCtx parameter, it was pre-existing, not introduced by this PR, so I'm not addressing it.

Cheers,
...Steve...

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 305-310: The code currently treats NetBird overlay addresses as
successful probes by setting success = true inside the
isNetBirdOverlayAddr(upstream) branch; remove the
mu.Lock()/success=true/mu.Unlock() lines so skipped overlay servers are simply
continued without marking success. Update the same change at the other
occurrence noted (the similar block around lines 340-341) so success is only set
when an actual probe succeeds and remains protected by mu where real success
assignments occur.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bcce1491-fad9-417b-aa19-423b8367575a

📥 Commits

Reviewing files that changed from the base of the PR and between 434ad13 and dc9aaa6.

📒 Files selected for processing (1)
  • client/internal/dns/upstream.go

Comment thread client/internal/dns/upstream.go
…tead of success=true for skipped overlay servers
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/upstream_test.go`:
- Around line 598-612: The tests use context.Background() for
upstreamResolverBase so when ProbeAvailability triggers deactivate() the
background context leaves waitUntilResponse() running and leaks goroutines; fix
both tests by creating the resolver with a cancellable context via
context.WithCancel (capture cancel) and register t.Cleanup(func(){ cancel();
resolver.Stop() }) or at minimum t.Cleanup(resolver.Stop) so the retry goroutine
started by waitUntilResponse() is stopped; update the tests that construct
upstreamResolverBase and the mixed-overlay failure test to use
context.WithCancel and t.Cleanup(resolver.Stop).

In `@client/internal/dns/upstream.go`:
- Around line 48-59: The current isNetBirdOverlayAddr/netbirdCGNATRange check
incorrectly treats every 100.64.0.0/10 resolver as a NetBird peer; change
isNetBirdOverlayAddr (and any callers) to only skip probing when the address is
both in netbirdCGNATRange AND is positively associated with a NetBird peer/route
(e.g., consult the local peer-to-IP mapping or routing table used by the engine
before returning true), otherwise return false so
DefaultServer.ProbeAvailability and engine startup will probe it normally;
update callers at the other noted sites (around lines ~303-310 and ~339-343) to
use the revised predicate; also add a regression test that creates a non-NetBird
nameserver in 100.64.0.0/10 and asserts it is probed and that groups are
disabled/emits the expected startup warning when no real upstream responds.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ef0deda3-7d57-4bd0-b618-5f4f91f1341d

📥 Commits

Reviewing files that changed from the base of the PR and between dc9aaa6 and 931dbae.

📒 Files selected for processing (2)
  • client/internal/dns/upstream.go
  • client/internal/dns/upstream_test.go

Comment on lines +598 to +612
deactivated := false
resolver := &upstreamResolverBase{
ctx: context.Background(),
upstreamClient: trackingClient,
upstreamServers: []netip.AddrPort{lanUpstream},
upstreamTimeout: UpstreamTimeout,
deactivate: func(error) { deactivated = true },
reactivate: func() {},
}

resolver.ProbeAvailability()

assert.True(t, probed, "LAN address should have been probed")
assert.True(t, deactivated, "resolver should have been deactivated when LAN probe fails")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

These failure-path tests leave the retry loop running.

Both resolvers use context.Background(). After ProbeAvailability() fails, disable() starts waitUntilResponse(), and that goroutine never exits because the mock never recovers and nothing cancels the context. That leaks work into the rest of the suite.

Use context.WithCancel here, set cancel, and register t.Cleanup(resolver.Stop) in both tests.

🧹 Suggested cleanup
-	resolver := &upstreamResolverBase{
-		ctx:             context.Background(),
+	ctx, cancel := context.WithCancel(context.Background())
+	resolver := &upstreamResolverBase{
+		ctx:             ctx,
+		cancel:          cancel,
 		upstreamClient:  trackingClient,
 		upstreamServers: []netip.AddrPort{lanUpstream},
 		upstreamTimeout: UpstreamTimeout,
 		deactivate:      func(error) { deactivated = true },
 		reactivate:      func() {},
 	}
+	t.Cleanup(resolver.Stop)

Apply the same pattern to the mixed-overlay failure test below.

Also applies to: 694-706

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream_test.go` around lines 598 - 612, The tests use
context.Background() for upstreamResolverBase so when ProbeAvailability triggers
deactivate() the background context leaves waitUntilResponse() running and leaks
goroutines; fix both tests by creating the resolver with a cancellable context
via context.WithCancel (capture cancel) and register t.Cleanup(func(){ cancel();
resolver.Stop() }) or at minimum t.Cleanup(resolver.Stop) so the retry goroutine
started by waitUntilResponse() is stopped; update the tests that construct
upstreamResolverBase and the mixed-overlay failure test to use
context.WithCancel and t.Cleanup(resolver.Stop).

Comment on lines +48 to +59
// netbirdCGNATRange is the CGNAT range (100.64.0.0/10) used by NetBird for
// overlay peer IP addresses. Addresses in this range are only reachable once
// the WireGuard tunnel to the peer is fully established.
var netbirdCGNATRange = netip.MustParsePrefix("100.64.0.0/10")

// isNetBirdOverlayAddr returns true if the given address is within the NetBird
// CGNAT range (100.64.0.0/10). Such addresses are only reachable after the
// WireGuard tunnel is established, so probing them at engine startup will
// always produce a false-positive "unable to reach" warning.
func isNetBirdOverlayAddr(addr netip.AddrPort) bool {
return netbirdCGNATRange.Contains(addr.Addr())
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not treat every 100.64.0.0/10 resolver as a NetBird peer.

100.64.0.0/10 is shared CGNAT space, not NetBird-exclusive. With the current check, any resolver in that block is skipped, so probedAny stays false and the group never disables or emits the startup warning even if that address is just a local/ISP resolver that's actually down. DefaultServer.ProbeAvailability() and engine startup both rely on this path to disable groups when no real upstream responds.

At minimum, only skip when you can positively map the address to a NetBird peer/route; otherwise keep the normal probe behavior. Please also add a regression case for a non-NetBird 100.64/10 nameserver.

💡 Safer gating direction
-func isNetBirdOverlayAddr(addr netip.AddrPort) bool {
-	return netbirdCGNATRange.Contains(addr.Addr())
+func (u *upstreamResolverBase) shouldSkipStartupProbe(addr netip.AddrPort) bool {
+	if !netbirdCGNATRange.Contains(addr.Addr()) {
+		return false
+	}
+	return findPeerForIP(addr.Addr(), u.statusRecorder) != nil
 }
@@
-		if isNetBirdOverlayAddr(upstream) {
+		if u.shouldSkipStartupProbe(upstream) {
 			log.Debugf("skipping probe for NetBird overlay nameserver %s (tunnel not yet established)", upstream)
 			continue
 		}

Also applies to: 303-310, 339-343

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 48 - 59, The current
isNetBirdOverlayAddr/netbirdCGNATRange check incorrectly treats every
100.64.0.0/10 resolver as a NetBird peer; change isNetBirdOverlayAddr (and any
callers) to only skip probing when the address is both in netbirdCGNATRange AND
is positively associated with a NetBird peer/route (e.g., consult the local
peer-to-IP mapping or routing table used by the engine before returning true),
otherwise return false so DefaultServer.ProbeAvailability and engine startup
will probe it normally; update callers at the other noted sites (around lines
~303-310 and ~339-343) to use the revised predicate; also add a regression test
that creates a non-NetBird nameserver in 100.64.0.0/10 and asserts it is probed
and that groups are disabled/emits the expected startup warning when no real
upstream responds.

@lixmal
Copy link
Copy Markdown
Collaborator

lixmal commented Apr 23, 2026

Superseded by #5971, which takes a different approach: rather than skipping probes for overlay IPs, it drops the synthetic probe entirely and derives group health from real query outcomes. Thanks for putting this together — the problem statement in your description informed the design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants