fix(dns): skip probe for NetBird overlay addresses at engine startup by sbeaudry-qms · Pull Request #5625 · netbirdio/netbird

sbeaudry-qms · 2026-03-18T18:31:08Z

Describe your changes

When a nameserver is configured with a NetBird overlay IP (in the 100.64.0.0/10 CGNAT range), ProbeAvailability() is called at engine startup before the WireGuard tunnel to that peer is established. The probe always fails at this point — not because the nameserver is down, but because the tunnel isn't up yet. This produces a spurious "Unable to reach one or more DNS servers" warning in netbird status -d even though DNS resolves correctly once the connection settles.

This change skips the startup probe for any nameserver whose IP falls within the NetBird CGNAT range. A debug log message is emitted instead. These nameservers continue to be exercised normally by real DNS queries, and the waitUntilResponse retry loop still handles genuine failures.

Non-overlay nameservers (LAN IPs, public DNS) are unaffected and continue to be probed as before.

Issue ticket number and link

N/A — no existing issue. This addresses a known annoyance discussed in the NetBird community forum: https://forum.netbird.io/t/unable-to-reach-one-or-more-dns-servers/204

Stack

Checklist

Documentation

Documentation is not needed for this change (explain why)

This is a bug fix for a false-positive warning with no user-facing behaviour change. DNS resolution is unaffected.

Summary by CodeRabbit

Bug Fixes
- Skip CGNAT overlay (100.64.0.0/10) addresses during startup probes to avoid false-positive DNS failures; skipping is logged and treated as passed.
- Improve shutdown coordination and concurrency safety for probing and disabling to ensure clean stop behavior.
- Aggregate probe errors and strengthen context-aware probing/cancellation for more reliable failure reporting.
Tests
- Added comprehensive tests covering overlay detection and probing behavior across LAN, public, mixed, and overlay-only scenarios.

Nameservers configured with overlay IPs (100.64.0.0/10 CGNAT range) are only reachable once the WireGuard tunnel is established. Probing them at engine startup always fails with a false-positive warning even though they become available moments later. Skip the startup probe for any nameserver in the NetBird CGNAT range. They continue to be exercised by real DNS queries and the waitUntilResponse retry loop if an actual failure occurs.

coderabbitai · 2026-03-18T18:31:30Z

📝 Walkthrough

Walkthrough

Adds NetBird CGNAT overlay detection (100.64.0.0/10) to skip probing those addresses, makes probing context-aware by passing base/external contexts, and introduces sync.WaitGroup and mutex synchronization for coordinated shutdown. Tests added to validate overlay skipping and probe behavior across scenarios.

Changes

Cohort / File(s)	Summary
DNS Upstream Resolver Core `client/internal/dns/upstream.go`	Adds `isNetBirdOverlayAddr` (100.64.0.0/10) and skips probing overlay addresses; updates `testNameserver` signature to accept `baseCtx` and `externalCtx` and propagates contexts; introduces `wg` (sync.WaitGroup) and mutex-protected `disabled` coordination; aggregates probe errors and adjusts disable/waitUntilResponse coordination.
DNS Upstream Tests `client/internal/dns/upstream_test.go`	Adds unit tests for overlay detection and ProbeAvailability behavior (overlay skip, LAN probing, mixed cases, overlay-only behavior); introduces tracking helpers and visitor wrappers to record probe activity per upstream address.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

[client] Fix DNS probe thread safety and avoid blocking engine sync #5576: Modifies the same DNS upstream resolver code, including adding a wg field and changing testNameserver to accept dual contexts.
[client] stop upstream retry loop immediately on context cancellation #5403: Alters upstream resolver cancellation/backoff and disable/locking logic related to wait loops and goroutine coordination.
[client] Add env var to skip DNS probing #5270: Previously adjusted ProbeAvailability behavior in the same module (client/internal/dns), related to probing skip conditions.

Suggested reviewers

lixmal
pappz

Poem

🐇 I sniff the nets where overlays hide,

100.64 tucked safe inside,
I skip those hops and mind my pace,
with contexts held and waitgroups' grace,
Hoppy probes—DNS with pride! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: skipping DNS probes for NetBird overlay addresses at startup.
Description check	✅ Passed	The description covers the problem context, solution approach, testing, and documentation status, matching the template requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Important

Merge conflicts detected (Beta)

Resolve merge conflict in branch fix/dns-skip-overlay-probe

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

client/internal/dns/upstream.go (1)
445-458: The externalCtx parameter is unused in this PR.

The testNameserver signature was updated to accept an externalCtx for cancellation coordination, but all call sites pass nil:

Line 319: u.testNameserver(u.ctx, nil, upstream, ...)

Line 380: u.testNameserver(u.ctx, nil, upstream, ...)

If this is preparation for future functionality, consider adding a brief comment. Otherwise, the simpler single-context signature would suffice for now.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 445 - 458, The externalCtx
parameter in upstreamResolverBase.testNameserver is unused (all callers pass
nil); remove the unused externalCtx parameter from the testNameserver signature
and update all callers (e.g., the u.testNameserver(...) invocations that
currently pass nil) to call the two-argument version (baseCtx, server, timeout)
so the function uses only the merged context created with context.WithTimeout;
alternatively, if you intended to support an external cancel context, add a
brief comment in testNameserver explaining why externalCtx is accepted and wire
real non-nil contexts from callers—prefer removing the unused parameter to keep
the API minimal.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 129-136: The Stop() method currently holds u.mutex then calls
u.wg.Wait(), causing a deadlock with goroutines spawned by disable() that call
waitUntilResponse() and need u.mutex; fix by releasing the lock before waiting
on the WaitGroup: move u.wg.Wait() to happen before acquiring u.mutex (or simply
unlock immediately after calling u.cancel() and call u.mutex.Lock()/Unlock()
only for protecting shared state modifications), ensuring
upstreamResolverBase.Stop() calls u.cancel() first, then waits on u.wg, and only
locks u.mutex when reading/modifying u.upstreamServers or other shared
fields—this removes the lock while waiting and prevents the deadlock between
Stop(), waitUntilResponse(), and the goroutine from disable().

---

Nitpick comments:
In `@client/internal/dns/upstream.go`:
- Around line 445-458: The externalCtx parameter in
upstreamResolverBase.testNameserver is unused (all callers pass nil); remove the
unused externalCtx parameter from the testNameserver signature and update all
callers (e.g., the u.testNameserver(...) invocations that currently pass nil) to
call the two-argument version (baseCtx, server, timeout) so the function uses
only the merged context created with context.WithTimeout; alternatively, if you
intended to support an external cancel context, add a brief comment in
testNameserver explaining why externalCtx is accepted and wire real non-nil
contexts from callers—prefer removing the unused parameter to keep the API
minimal.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 34bf3e38-39f8-44f2-a134-6b6863e5e8d3

📥 Commits

Reviewing files that changed from the base of the PR and between a1858a9 and 434ad13.

📒 Files selected for processing (2)

client/internal/dns/upstream.go
client/internal/dns/upstream_test.go

…it() in Stop() to prevent deadlock

sbeaudry-qms · 2026-03-18T19:07:29Z

Regarding the CodeRabbit nitpick of the externalCtx parameter, it was pre-existing, not introduced by this PR, so I'm not addressing it.

Cheers,
...Steve...

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 305-310: The code currently treats NetBird overlay addresses as
successful probes by setting success = true inside the
isNetBirdOverlayAddr(upstream) branch; remove the
mu.Lock()/success=true/mu.Unlock() lines so skipped overlay servers are simply
continued without marking success. Update the same change at the other
occurrence noted (the similar block around lines 340-341) so success is only set
when an actual probe succeeds and remains protected by mu where real success
assignments occur.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bcce1491-fad9-417b-aa19-423b8367575a

📥 Commits

Reviewing files that changed from the base of the PR and between 434ad13 and dc9aaa6.

📒 Files selected for processing (1)

client/internal/dns/upstream.go

…tead of success=true for skipped overlay servers

sonarqubecloud · 2026-03-18T19:26:34Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/upstream_test.go`:
- Around line 598-612: The tests use context.Background() for
upstreamResolverBase so when ProbeAvailability triggers deactivate() the
background context leaves waitUntilResponse() running and leaks goroutines; fix
both tests by creating the resolver with a cancellable context via
context.WithCancel (capture cancel) and register t.Cleanup(func(){ cancel();
resolver.Stop() }) or at minimum t.Cleanup(resolver.Stop) so the retry goroutine
started by waitUntilResponse() is stopped; update the tests that construct
upstreamResolverBase and the mixed-overlay failure test to use
context.WithCancel and t.Cleanup(resolver.Stop).

In `@client/internal/dns/upstream.go`:
- Around line 48-59: The current isNetBirdOverlayAddr/netbirdCGNATRange check
incorrectly treats every 100.64.0.0/10 resolver as a NetBird peer; change
isNetBirdOverlayAddr (and any callers) to only skip probing when the address is
both in netbirdCGNATRange AND is positively associated with a NetBird peer/route
(e.g., consult the local peer-to-IP mapping or routing table used by the engine
before returning true), otherwise return false so
DefaultServer.ProbeAvailability and engine startup will probe it normally;
update callers at the other noted sites (around lines ~303-310 and ~339-343) to
use the revised predicate; also add a regression test that creates a non-NetBird
nameserver in 100.64.0.0/10 and asserts it is probed and that groups are
disabled/emits the expected startup warning when no real upstream responds.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ef0deda3-7d57-4bd0-b618-5f4f91f1341d

📥 Commits

Reviewing files that changed from the base of the PR and between dc9aaa6 and 931dbae.

📒 Files selected for processing (2)

client/internal/dns/upstream.go
client/internal/dns/upstream_test.go

coderabbitai · 2026-03-18T19:36:56Z

+	deactivated := false
+	resolver := &upstreamResolverBase{
+		ctx:             context.Background(),
+		upstreamClient:  trackingClient,
+		upstreamServers: []netip.AddrPort{lanUpstream},
+		upstreamTimeout: UpstreamTimeout,
+		deactivate:      func(error) { deactivated = true },
+		reactivate:      func() {},
+	}
+
+	resolver.ProbeAvailability()
+
+	assert.True(t, probed, "LAN address should have been probed")
+	assert.True(t, deactivated, "resolver should have been deactivated when LAN probe fails")
+}


⚠️ Potential issue | 🟡 Minor

These failure-path tests leave the retry loop running.

Both resolvers use context.Background(). After ProbeAvailability() fails, disable() starts waitUntilResponse(), and that goroutine never exits because the mock never recovers and nothing cancels the context. That leaks work into the rest of the suite.

Use context.WithCancel here, set cancel, and register t.Cleanup(resolver.Stop) in both tests.

🧹 Suggested cleanup

- resolver := &upstreamResolverBase{ - ctx: context.Background(), + ctx, cancel := context.WithCancel(context.Background()) + resolver := &upstreamResolverBase{ + ctx: ctx, + cancel: cancel, upstreamClient: trackingClient, upstreamServers: []netip.AddrPort{lanUpstream}, upstreamTimeout: UpstreamTimeout, deactivate: func(error) { deactivated = true }, reactivate: func() {}, } + t.Cleanup(resolver.Stop)

Apply the same pattern to the mixed-overlay failure test below.

Also applies to: 694-706

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream_test.go` around lines 598 - 612, The tests use context.Background() for upstreamResolverBase so when ProbeAvailability triggers deactivate() the background context leaves waitUntilResponse() running and leaks goroutines; fix both tests by creating the resolver with a cancellable context via context.WithCancel (capture cancel) and register t.Cleanup(func(){ cancel(); resolver.Stop() }) or at minimum t.Cleanup(resolver.Stop) so the retry goroutine started by waitUntilResponse() is stopped; update the tests that construct upstreamResolverBase and the mixed-overlay failure test to use context.WithCancel and t.Cleanup(resolver.Stop).

coderabbitai · 2026-03-18T19:36:56Z

+// netbirdCGNATRange is the CGNAT range (100.64.0.0/10) used by NetBird for
+// overlay peer IP addresses. Addresses in this range are only reachable once
+// the WireGuard tunnel to the peer is fully established.
+var netbirdCGNATRange = netip.MustParsePrefix("100.64.0.0/10")
+
+// isNetBirdOverlayAddr returns true if the given address is within the NetBird
+// CGNAT range (100.64.0.0/10). Such addresses are only reachable after the
+// WireGuard tunnel is established, so probing them at engine startup will
+// always produce a false-positive "unable to reach" warning.
+func isNetBirdOverlayAddr(addr netip.AddrPort) bool {
+	return netbirdCGNATRange.Contains(addr.Addr())
+}


⚠️ Potential issue | 🟠 Major

Do not treat every 100.64.0.0/10 resolver as a NetBird peer.

100.64.0.0/10 is shared CGNAT space, not NetBird-exclusive. With the current check, any resolver in that block is skipped, so probedAny stays false and the group never disables or emits the startup warning even if that address is just a local/ISP resolver that's actually down. DefaultServer.ProbeAvailability() and engine startup both rely on this path to disable groups when no real upstream responds.

At minimum, only skip when you can positively map the address to a NetBird peer/route; otherwise keep the normal probe behavior. Please also add a regression case for a non-NetBird 100.64/10 nameserver.

💡 Safer gating direction

-func isNetBirdOverlayAddr(addr netip.AddrPort) bool { - return netbirdCGNATRange.Contains(addr.Addr()) +func (u *upstreamResolverBase) shouldSkipStartupProbe(addr netip.AddrPort) bool { + if !netbirdCGNATRange.Contains(addr.Addr()) { + return false + } + return findPeerForIP(addr.Addr(), u.statusRecorder) != nil } @@ - if isNetBirdOverlayAddr(upstream) { + if u.shouldSkipStartupProbe(upstream) { log.Debugf("skipping probe for NetBird overlay nameserver %s (tunnel not yet established)", upstream) continue }

Also applies to: 303-310, 339-343

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 48 - 59, The current isNetBirdOverlayAddr/netbirdCGNATRange check incorrectly treats every 100.64.0.0/10 resolver as a NetBird peer; change isNetBirdOverlayAddr (and any callers) to only skip probing when the address is both in netbirdCGNATRange AND is positively associated with a NetBird peer/route (e.g., consult the local peer-to-IP mapping or routing table used by the engine before returning true), otherwise return false so DefaultServer.ProbeAvailability and engine startup will probe it normally; update callers at the other noted sites (around lines ~303-310 and ~339-343) to use the revised predicate; also add a regression test that creates a non-NetBird nameserver in 100.64.0.0/10 and asserts it is probed and that groups are disabled/emits the expected startup warning when no real upstream responds.

lixmal · 2026-04-23T11:34:15Z

Superseded by #5971, which takes a different approach: rather than skipping probes for overlay IPs, it drops the synthetic probe entirely and derives group health from real query outcomes. Thanks for putting this together — the problem statement in your description informed the design.

coderabbitai Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread client/internal/dns/upstream.go

fix(dns): Addressing CodeRabbit suggestion. Remove mutex around wg.Wa…

dc9aaa6

…it() in Stop() to prevent deadlock

coderabbitai Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread client/internal/dns/upstream.go

fix(dns): Following CodeRabbit's suggestion: Using probedAny flag ins…

931dbae

…tead of success=true for skipped overlay servers

coderabbitai Bot reviewed Mar 18, 2026

View reviewed changes

lixmal closed this Apr 23, 2026

lixmal mentioned this pull request Apr 23, 2026

[client] Drop DNS probes for passive health projection #5971

Open

7 tasks

Uh oh!

Conversation

sbeaudry-qms commented Mar 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Issue ticket number and link

Stack

Checklist

Documentation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Merge conflicts detected (Beta)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sbeaudry-qms commented Mar 18, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Mar 18, 2026

Quality Gate passed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

lixmal commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sbeaudry-qms commented Mar 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 18, 2026 •

edited

Loading