Conversation
When an exit-node peer's network-map installs a 0.0.0.0/0 default route
on the overlay interface before that peer's WireGuard key material is
active, any UDP socket dialing an off-link address is routed into wt0
and the kernel returns ENOKEY.
Two places needed fixing:
1. The mgmt cache refresh path. It reactively refreshes the
control-plane FQDNs advertised by the mgmt (api/signal/stun/turn/
the Relay pool root) after the daemon has installed its own
resolv.conf pointing at the overlay listener. Previously the
refresh dial followed the chain's upstream handler, which followed
the overlay default route and deadlocked on ENOKEY.
2. Foreign relay FQDN resolution. When a remote peer is homed on a
different relay instance than us, we need to resolve a streamline-*
subdomain that is not in the cache. That lookup went through the
same overlay-routed upstream and failed identically, deadlocking
the exit-node test whenever the relay LB put the two peers on
different instances.
Fix both by giving the mgmt cache a dedicated net.Resolver that dials
the original pre-NetBird system nameservers through nbnet.NewDialer.
The dialer marks the socket as control-plane (SO_MARK on Linux,
IP_BOUND_IF on darwin, IP_UNICAST_IF on Windows); the routemanager's
policy rules keep those sockets on the underlay regardless of the
overlay default.
Pool-root domains (the Relay entries in ServerDomains) now register
through a subdomain-matching wrapper so that instance subdomains like
streamline-de-fra1-0.relay.netbird.io also hit the mgmt cache handler.
On cache miss under a pool root, ServeDNS resolves the FQDN on demand
through the bypass resolver, caches the result, and returns it.
Pool-root membership is derived dynamically from mgmt-advertised
ServerDomains.Relay[] — no hardcoded domain lists, no protocol change.
No hardcoded fallback nameservers: if the host had no original system
resolver at all, the bypass resolver stays nil and the stale-while-
revalidate cache keeps serving. The general upstream forwarder and
the user DNS path are unchanged.
The subdomain-matching handler chain entry only helps when the OS DNS manager routes the query to the daemon's listener. On systems using systemd-resolved (Ubuntu 20.04 / 22.04, most modern Debian/RHEL-family distros), the daemon delegates a *specific set* of match domains to the wt0 link via D-Bus. Any FQDN outside that set is answered from the host's global upstream DNS, never touching the handler chain. With only 'netbird.cloud' and the in-addr.arpa reverse zone advertised as match domains, a relay instance FQDN like streamline-de-fra1-0.relay.netbird.io is resolved globally. When an exit-node default route is active on wt0 but the peer is not yet live, that global lookup times out on the local stub (127.0.0.53) because systemd-resolved's uplink DNS is unreachable through the overlay, and the foreign-relay dial fails exactly the way the ENOKEY path did on file-manager hosts. Bump the pool-root domains (ServerDomains.Relay entries) into extraDomains so applyHostConfig adds them as match-only domains alongside 'netbird.cloud'. systemd-resolved then delegates *.relay.netbird.io to the wt0 link, queries hit the daemon's DNS listener, and the on-demand resolve path via the bypass resolver runs as intended. Tracking map mgmtPoolRoots isolates this refcount from the RegisterHandler path so updates across successive mgmt syncs increment/decrement only the changed set.
📝 WalkthroughWalkthroughAdds a bypass DNS resolver factory and integrates it into the management resolver and DNS server: on cache misses for relay pool-root subdomains the mgmt resolver can perform singleflight on‑demand bypass lookups, cache A/AAAA results, and server config tracks/registers pool-root domains separately. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client as Client
participant Handler as DNS Handler
participant Mgmt as Mgmt Cache Resolver
participant Bypass as Bypass Resolver
participant NS as Remote Nameserver
Client->>Handler: DNS query (pool-root subdomain)
Handler->>Mgmt: ServeDNS (cache miss)
Mgmt->>Mgmt: Detect pool-root subdomain
Mgmt->>Bypass: lookupRecords (singleflight)
Bypass->>NS: Dial <ip>:53 (try list sequentially)
alt Dial success
NS-->>Bypass: TCP/UDP conn established
Bypass->>NS: DNS query
NS-->>Bypass: DNS response
Bypass-->>Mgmt: A/AAAA records
Mgmt->>Mgmt: Cache records
Mgmt-->>Handler: DNS response
else All dials fail
Bypass-->>Mgmt: Dial error
Mgmt-->>Handler: Defer to next handler / fallback
end
Handler-->>Client: DNS response or fallback
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
client/internal/dns/mgmt/bypass_resolver.go (1)
39-53: Consider a per-server timeout to prevent one slow nameserver from exhausting the parent context deadline.The
net.ResolverwithPreferGo: truecallsDialonce per server without built-in per-server timeout protection. Sequential iteration means an unreachable or blackholed nameserver can consume the entirectxdeadline before the next one is attempted, defeating the point of failover. A short per-server timeout (e.g., 2–3 seconds) derived from the parent context preserves failover semantics while keeping one slow resolver from blocking the others. Additionally,nbnet.NewDialer()can be hoisted outside the closure sincenet.Dialeris safe for concurrent reuse and this avoids redundant allocations on every call.♻️ Suggested refinement
- return &net.Resolver{ - PreferGo: true, - Dial: func(ctx context.Context, network, _ string) (net.Conn, error) { - nbDialer := nbnet.NewDialer() - var lastErr error - for _, ns := range servers { - conn, err := nbDialer.DialContext(ctx, network, ns) - if err == nil { - return conn, nil - } - lastErr = err - } - if lastErr == nil { - return nil, fmt.Errorf("no bypass nameservers configured") - } - return nil, fmt.Errorf("dial bypass nameservers: %w", lastErr) - }, - } + nbDialer := nbnet.NewDialer() + const perServerTimeout = 2 * time.Second + return &net.Resolver{ + PreferGo: true, + Dial: func(ctx context.Context, network, _ string) (net.Conn, error) { + var lastErr error + for _, ns := range servers { + dialCtx, cancel := context.WithTimeout(ctx, perServerTimeout) + conn, err := nbDialer.DialContext(dialCtx, network, ns) + cancel() + if err == nil { + return conn, nil + } + lastErr = err + if ctx.Err() != nil { + break + } + } + if lastErr == nil { + return nil, fmt.Errorf("no bypass nameservers configured") + } + return nil, fmt.Errorf("dial bypass nameservers: %w", lastErr) + }, + }(Requires
"time"import.)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/mgmt/bypass_resolver.go` around lines 39 - 53, The Dial closure currently creates a new nbnet.Dialer per call and iterates servers without per-server timeouts, allowing a slow nameserver to exhaust the parent ctx; hoist nbnet.NewDialer() outside the Dial func for reuse and, inside the Dial implementation (the anonymous func assigned to Dial), for each ns create a child context with a short timeout (e.g., 2–3s) via context.WithTimeout(ctx, timeout), use that child context when calling nbDialer.DialContext, ensure you call the cancel func to avoid leaks, collect the last error as before, and return a wrapped error if all servers fail; keep the existing messages ("no bypass nameservers configured" and "dial bypass nameservers: %w") and preserve the servers iteration and lastErr handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 227-268: resolveOnDemand currently allows N concurrent upstream
lookups because the TOCTOU cache write only dedups writes; wrap the
lookupRecords call with the Resolver's refreshGroup singleflight by calling
m.refreshGroup.Do with a namespaced key (e.g.
"ondemand:"+question.Name+":"+strconv.Itoa(int(question.Qtype))) so concurrent
resolveOnDemand callers collapse into one lookup; inside the Do function perform
lookupRecords(ctx, d, question) and return the records or error, then outside
the Do retrieve the shared result (type-assert to the records slice), handle
errors exactly as before, and perform the single cache write into m.records (use
the same exists check + mutex) and reply to the client—this prevents parallel
upstream queries while preserving existing caching and error paths for
resolveOnDemand, lookupRecords, m.records, and refreshGroup.
In `@client/internal/dns/server.go`:
- Around line 635-654: Canonicalize pool root entries before building
poolRootSet so they match the form returned by
s.mgmtCacheResolver.GetCachedDomains(): call the same normalization used
elsewhere (e.g., toZone(d) / dns.Fqdn(strings.ToLower(...))) when iterating
s.mgmtCacheResolver.GetPoolRootDomains() and use that value as the map key; then
the subsequent loop that checks if _, isPool := poolRootSet[d]; isPool will
correctly dedupe and avoid appending pool roots into exactDomains, preserving
the earlier subdomainMatchHandler registration (affecting poolRootSet,
GetPoolRootDomains, GetCachedDomains, toZone, registerHandler,
subdomainMatchHandler, PriorityMgmtCache).
---
Nitpick comments:
In `@client/internal/dns/mgmt/bypass_resolver.go`:
- Around line 39-53: The Dial closure currently creates a new nbnet.Dialer per
call and iterates servers without per-server timeouts, allowing a slow
nameserver to exhaust the parent ctx; hoist nbnet.NewDialer() outside the Dial
func for reuse and, inside the Dial implementation (the anonymous func assigned
to Dial), for each ns create a child context with a short timeout (e.g., 2–3s)
via context.WithTimeout(ctx, timeout), use that child context when calling
nbDialer.DialContext, ensure you call the cancel func to avoid leaks, collect
the last error as before, and return a wrapped error if all servers fail; keep
the existing messages ("no bypass nameservers configured" and "dial bypass
nameservers: %w") and preserve the servers iteration and lastErr handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b3835e7e-375f-4ce6-ab1f-b093ea24c535
📒 Files selected for processing (3)
client/internal/dns/mgmt/bypass_resolver.goclient/internal/dns/mgmt/mgmt.goclient/internal/dns/server.go
GetPoolRootDomains returns what the extractor stored; GetCachedDomains strips the trailing dot from question names. Today those happen to produce the same domain.Domain string because both sources run through domain.FromString, but relying on that parity is fragile. Run both sides of the poolRootSet membership check through toZone so the comparison is independent of each source's canonical form. Prevents a future extractor or cache-entry path from silently routing a pool-root into exactDomains, where a second registerHandler call with the bare resolver would clobber the subdomainMatchHandler registration (AddHandler replaces entries with the same pattern + priority).
resolveOnDemand deduplicated the cache write but not the upstream lookup — a burst of concurrent queries for a freshly-learned pool-root subdomain (A + AAAA from a single getaddrinfo, or multiple peer workers dialing the same foreign relay) each hit the bypass resolver independently. Wrap the lookupRecords call in refreshGroup.Do with an 'ondemand:'- prefixed key so concurrent callers collapse into one request; the prefix namespaces this key off scheduleRefresh's keyspace so the two paths don't collide. Error and empty-result paths preserved; the cache write still runs under the existing TOCTOU-safe exists check + mutex so a winning refresh isn't clobbered.
UpdateServerConfig was at 31; SonarCloud threshold is 20. The deep nesting came from wrapping the whole body in the mgmtCacheResolver nil-check and stacking five for-loops + conditionals inside it. Invert to an early return when the resolver is nil, then split the remaining work into two focused helpers: - registerMgmtCacheHandlers: handler-chain wiring (pool-root subdomain wrapper + exact-match registration) with the toZone- canonicalized membership check. - reconcileMgmtPoolRoots: extraDomains refcounting against the current pool-root set, plus the mgmtPoolRoots tracking map. Pure refactor — behaviour is identical.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 205-222: isUnderPoolRoot currently lowercases and trims trailing
dots but does not apply the same canonicalization used during handler
registration; update isUnderPoolRoot to call the shared toZone (which uses
nbdns.NormalizeZone to strip wildcard prefixes) on both fqdn and each relay root
before comparing so membership checks match registerMgmtCacheHandlers
canonicalization; locate Resolver.isUnderPoolRoot and replace the inline
normalization with calls to toZone (or inline the same NormalizeZone logic if
toZone is not directly accessible) ensuring you handle empty/invalid zones the
same way registerMgmtCacheHandlers does.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 28d56ddb-fe6c-4d39-a186-ff0b8c377058
📒 Files selected for processing (2)
client/internal/dns/mgmt/mgmt.goclient/internal/dns/server.go
🚧 Files skipped from review as they are similar to previous changes (1)
- client/internal/dns/server.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
client/internal/dns/server.go (1)
424-442:mgmtPoolRootsnot cleared on Stop.
Stop()callsmaps.Clear(s.extraDomains)but does not clears.mgmtPoolRoots. If the server is ever restarted in-place (not typical but the field outlives a Stop), the nextreconcileMgmtPoolRootswill decrement counts in a freshly-clearedextraDomainsmap, producing negative values which then get deleted on the next tick — not a correctness bug today, but the two tracking maps intentionally mirror each other per the comment on line 120-124, so keeping them in sync on Stop matches that intent.🧹 Proposed cleanup
maps.Clear(s.extraDomains) + maps.Clear(s.mgmtPoolRoots)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/server.go` around lines 424 - 442, The Stop() method clears s.extraDomains but forgets to clear s.mgmtPoolRoots, leaving the two tracking maps out of sync; update Stop() to also clear s.mgmtPoolRoots (same place you call maps.Clear(s.extraDomains)) so that s.mgmtPoolRoots is reset before any potential restart and before reconcileMgmtPoolRoots runs, ensuring both maps mirror each other as intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@client/internal/dns/server.go`:
- Around line 424-442: The Stop() method clears s.extraDomains but forgets to
clear s.mgmtPoolRoots, leaving the two tracking maps out of sync; update Stop()
to also clear s.mgmtPoolRoots (same place you call maps.Clear(s.extraDomains))
so that s.mgmtPoolRoots is reset before any potential restart and before
reconcileMgmtPoolRoots runs, ensuring both maps mirror each other as intended.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6b74075e-7950-4395-a322-ec6a272842a7
📒 Files selected for processing (1)
client/internal/dns/server.go
isUnderPoolRoot lowercased and trimmed the trailing dot, but did not
strip a leading "*." wildcard the way server.toZone does via
nbdns.NormalizeZone. If a mgmt-advertised Relay URL ever comes through
as "*.relay.netbird.io", the handler-chain registration side strips
the wildcard (toZone) but the membership check here would keep it,
so HasSuffix(".*.relay.netbird.io") would never match legitimate
instance subdomains and on-demand resolves would not fire.
Today the extractor lowercases + IDNA-normalizes URLs and rejects the
wildcard form, so the divergence is latent. Close it anyway by running
both sides of the membership check through a shared
canonicalizePoolDomain helper that mirrors toZone's transformation
(modulo trailing-dot orientation, which is self-consistent within
this function). toZone itself lives in the parent dns package and
cannot be imported here without a cycle.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
client/internal/dns/mgmt/mgmt.go (1)
466-518: Avoid double-readingbypassResolver; pass it intoosLookup.
lookupBoth/lookupRecordsreadm.bypassResolverto pick a branch, and thenosLookup(lines 569-574) reads it again under its own RLock. Two small downsides:
- If
SetBypassResolver(nil)ever races between the outer read and the inner read, the OS path silently runs againstnet.DefaultResolverwithout the caller arming the loop detector (markRefreshing) — the branch predicate and the resolver chosen can disagree. In practiceSetBypassResolveris called once at setup so this is theoretical, but the invariant is easy to break later.- Redundant RLock on the hot path.
Thread the already-read resolver into
osLookupso the decision and the dial agree, and drop the second lock:♻️ Proposed refactor
-func (m *Resolver) osLookup(ctx context.Context, d domain.Domain, dnsName string, qtype uint16) ([]dns.RR, error) { +func (m *Resolver) osLookup(ctx context.Context, d domain.Domain, dnsName string, qtype uint16) ([]dns.RR, error) { + m.mutex.RLock() + resolver := m.bypassResolver + m.mutex.RUnlock() + if resolver == nil { + resolver = net.DefaultResolver + } + return m.osLookupWith(ctx, resolver, d, dnsName, qtype) +} + +func (m *Resolver) osLookupWith(ctx context.Context, resolver *net.Resolver, d domain.Domain, dnsName string, qtype uint16) ([]dns.RR, error) { network := resutil.NetworkForQtype(qtype) if network == "" { return nil, fmt.Errorf("unsupported qtype %s", dns.TypeToString[qtype]) } log.Infof("looking up IP for mgmt domain=%s type=%s", d.SafeString(), dns.TypeToString[qtype]) defer log.Infof("done looking up IP for mgmt domain=%s type=%s", d.SafeString(), dns.TypeToString[qtype]) - m.mutex.RLock() - resolver := m.bypassResolver - m.mutex.RUnlock() - if resolver == nil { - resolver = net.DefaultResolver - } result := resutil.LookupIP(ctx, resolver, network, d.PunycodeString(), qtype) ... }Then in
lookupBoth/lookupRecords, switch theif bypass != nilbranches to callosLookupWith(ctx, bypass, ...)directly (and keep the existingosLookup(...)calls on the legacy OS-fallback path so they armmarkRefreshingas today).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/mgmt/mgmt.go` around lines 466 - 518, lookupBoth and lookupRecords read m.bypassResolver then call osLookup which re-reads m.bypassResolver under its own RLock, risking a race and extra locking; refactor by adding an osLookupWith(ctx, bypassResolver, ...) (or a bypass parameter to osLookup) that uses the already-read bypass resolver instead of re-locking, update lookupBoth and lookupRecords to call osLookupWith when bypass != nil, remove the redundant RLock/read inside osLookupWhenProvided (or branch inside osLookup to skip the second lock when a non-nil bypass parameter is supplied), and ensure callers that rely on the legacy behavior (the OS-fallback path that needs markRefreshing) continue to call the original osLookup so markRefreshing semantics are preserved; update references to osLookup, osLookupWith, lookupBoth, lookupRecords and SetBypassResolver accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 466-518: lookupBoth and lookupRecords read m.bypassResolver then
call osLookup which re-reads m.bypassResolver under its own RLock, risking a
race and extra locking; refactor by adding an osLookupWith(ctx, bypassResolver,
...) (or a bypass parameter to osLookup) that uses the already-read bypass
resolver instead of re-locking, update lookupBoth and lookupRecords to call
osLookupWith when bypass != nil, remove the redundant RLock/read inside
osLookupWhenProvided (or branch inside osLookup to skip the second lock when a
non-nil bypass parameter is supplied), and ensure callers that rely on the
legacy behavior (the OS-fallback path that needs markRefreshing) continue to
call the original osLookup so markRefreshing semantics are preserved; update
references to osLookup, osLookupWith, lookupBoth, lookupRecords and
SetBypassResolver accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 623e1848-d1aa-425e-a017-e35467008070
📒 Files selected for processing (1)
client/internal/dns/mgmt/mgmt.go



Describe your changes
Exit-Node Peer Routing Issue and Fix
Problem
When an exit-node peer installs a 0.0.0.0/0 default route on wt0 before its WireGuard peer entry is active, any socket dialing an off-link address via that route returns kernel ENOKEY.
This deadlocks a specific path: when a remote peer is homed on a different relay instance than the local daemon, the daemon must dial a foreign relay FQDN (e.g. streamline-de-fra1-0.relay.netbird.io). That FQDN is not in the mgmt cache, so it falls through to the normal upstream handler — which takes the overlay-routed dial and fails. No DNS → no relay → no peer handshake → no DNS.
Deterministic, but only triggers when the relay load-balancer puts peers on different instances. Caused the intermittent exit-node e2e failures on rocky9 / debian-12.
Fix
mgmt.Resolver bypass mechanism: mgmt.Resolver gets a dedicated *net.Resolver whose Dial uses nbnet.NewDialer(). On Linux the socket gets SO_MARK=ControlPlaneMark, on darwin IP_BOUND_IF, on Windows IP_UNICAST_IF — the same bypass mechanism gRPC/STUN/TURN/relay-websocket already use. Policy routes keep it on the underlay interface.
Original nameservers: Dialer targets the original pre-NetBird system nameservers, captured from the registerFallback flow. No hardcoded fallback IPs; if the host had no original resolver, the bypass resolver stays nil and the stale-while-revalidate cache serves.
Pool-root domains: Pool-root domains (ServerDomains.Relay[], e.g. relay.netbird.io) are registered through a thin subdomainMatchHandler wrapper, so instance subdomains (streamline-*.relay.netbird.io) route to the mgmt cache.
Cache miss resolution: On cache miss under a pool root, ServeDNS resolves on demand via the bypass resolver, caches the result, and returns it.
OS DNS manager integration: Pool-roots are also pushed into extraDomains so applyHostConfig advertises them as match domains to systemd-resolved / NetworkManager / scutil. Without this, those OS DNS managers would answer the query from the host's global upstream, skipping the handler chain entirely — which is exactly what happened on ubuntu-20.04.
Issue ticket number and link
Stack
Checklist
Documentation
Select exactly one:
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
New Features
Bug Fixes