Skip to content

[client] Fix mgmt cache bypass overlay#5987

Open
pappz wants to merge 6 commits intomainfrom
fix-mgmt-cache-bypass-overlay
Open

[client] Fix mgmt cache bypass overlay#5987
pappz wants to merge 6 commits intomainfrom
fix-mgmt-cache-bypass-overlay

Conversation

@pappz
Copy link
Copy Markdown
Collaborator

@pappz pappz commented Apr 24, 2026

Describe your changes

Exit-Node Peer Routing Issue and Fix

Problem

When an exit-node peer installs a 0.0.0.0/0 default route on wt0 before its WireGuard peer entry is active, any socket dialing an off-link address via that route returns kernel ENOKEY.

This deadlocks a specific path: when a remote peer is homed on a different relay instance than the local daemon, the daemon must dial a foreign relay FQDN (e.g. streamline-de-fra1-0.relay.netbird.io). That FQDN is not in the mgmt cache, so it falls through to the normal upstream handler — which takes the overlay-routed dial and fails. No DNS → no relay → no peer handshake → no DNS.

Deterministic, but only triggers when the relay load-balancer puts peers on different instances. Caused the intermittent exit-node e2e failures on rocky9 / debian-12.

Fix

  • mgmt.Resolver bypass mechanism: mgmt.Resolver gets a dedicated *net.Resolver whose Dial uses nbnet.NewDialer(). On Linux the socket gets SO_MARK=ControlPlaneMark, on darwin IP_BOUND_IF, on Windows IP_UNICAST_IF — the same bypass mechanism gRPC/STUN/TURN/relay-websocket already use. Policy routes keep it on the underlay interface.

  • Original nameservers: Dialer targets the original pre-NetBird system nameservers, captured from the registerFallback flow. No hardcoded fallback IPs; if the host had no original resolver, the bypass resolver stays nil and the stale-while-revalidate cache serves.

  • Pool-root domains: Pool-root domains (ServerDomains.Relay[], e.g. relay.netbird.io) are registered through a thin subdomainMatchHandler wrapper, so instance subdomains (streamline-*.relay.netbird.io) route to the mgmt cache.

  • Cache miss resolution: On cache miss under a pool root, ServeDNS resolves on demand via the bypass resolver, caches the result, and returns it.

  • OS DNS manager integration: Pool-roots are also pushed into extraDomains so applyHostConfig advertises them as match domains to systemd-resolved / NetworkManager / scutil. Without this, those OS DNS managers would answer the query from the host's global upstream, skipping the handler chain entirely — which is exactly what happened on ubuntu-20.04.

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • New Features

    • On-demand resolution for instance subdomains under relay domains with deduplicated lookups and cached A/AAAA answers for faster responses.
    • Configurable bypass resolver to route DNS queries directly to specified upstream nameservers with robust dialing and failover behavior.
  • Bug Fixes

    • Improved handling and tracking of relay “pool-root” domains during config updates to avoid stale registrations.
    • More reliable nameserver fallback, tracking, and failover when upstreams change or are absent.

pappz added 2 commits April 24, 2026 17:40
When an exit-node peer's network-map installs a 0.0.0.0/0 default route
on the overlay interface before that peer's WireGuard key material is
active, any UDP socket dialing an off-link address is routed into wt0
and the kernel returns ENOKEY.

Two places needed fixing:

 1. The mgmt cache refresh path. It reactively refreshes the
    control-plane FQDNs advertised by the mgmt (api/signal/stun/turn/
    the Relay pool root) after the daemon has installed its own
    resolv.conf pointing at the overlay listener. Previously the
    refresh dial followed the chain's upstream handler, which followed
    the overlay default route and deadlocked on ENOKEY.

 2. Foreign relay FQDN resolution. When a remote peer is homed on a
    different relay instance than us, we need to resolve a streamline-*
    subdomain that is not in the cache. That lookup went through the
    same overlay-routed upstream and failed identically, deadlocking
    the exit-node test whenever the relay LB put the two peers on
    different instances.

Fix both by giving the mgmt cache a dedicated net.Resolver that dials
the original pre-NetBird system nameservers through nbnet.NewDialer.
The dialer marks the socket as control-plane (SO_MARK on Linux,
IP_BOUND_IF on darwin, IP_UNICAST_IF on Windows); the routemanager's
policy rules keep those sockets on the underlay regardless of the
overlay default.

Pool-root domains (the Relay entries in ServerDomains) now register
through a subdomain-matching wrapper so that instance subdomains like
streamline-de-fra1-0.relay.netbird.io also hit the mgmt cache handler.
On cache miss under a pool root, ServeDNS resolves the FQDN on demand
through the bypass resolver, caches the result, and returns it.

Pool-root membership is derived dynamically from mgmt-advertised
ServerDomains.Relay[] — no hardcoded domain lists, no protocol change.
No hardcoded fallback nameservers: if the host had no original system
resolver at all, the bypass resolver stays nil and the stale-while-
revalidate cache keeps serving. The general upstream forwarder and
the user DNS path are unchanged.
The subdomain-matching handler chain entry only helps when the OS DNS
manager routes the query to the daemon's listener. On systems using
systemd-resolved (Ubuntu 20.04 / 22.04, most modern Debian/RHEL-family
distros), the daemon delegates a *specific set* of match domains to
the wt0 link via D-Bus. Any FQDN outside that set is answered from
the host's global upstream DNS, never touching the handler chain.

With only 'netbird.cloud' and the in-addr.arpa reverse zone advertised
as match domains, a relay instance FQDN like
streamline-de-fra1-0.relay.netbird.io is resolved globally. When an
exit-node default route is active on wt0 but the peer is not yet
live, that global lookup times out on the local stub (127.0.0.53)
because systemd-resolved's uplink DNS is unreachable through the
overlay, and the foreign-relay dial fails exactly the way the
ENOKEY path did on file-manager hosts.

Bump the pool-root domains (ServerDomains.Relay entries) into
extraDomains so applyHostConfig adds them as match-only domains
alongside 'netbird.cloud'. systemd-resolved then delegates
*.relay.netbird.io to the wt0 link, queries hit the daemon's DNS
listener, and the on-demand resolve path via the bypass resolver
runs as intended.

Tracking map mgmtPoolRoots isolates this refcount from the
RegisterHandler path so updates across successive mgmt syncs
increment/decrement only the changed set.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

📝 Walkthrough

Walkthrough

Adds a bypass DNS resolver factory and integrates it into the management resolver and DNS server: on cache misses for relay pool-root subdomains the mgmt resolver can perform singleflight on‑demand bypass lookups, cache A/AAAA results, and server config tracks/registers pool-root domains separately.

Changes

Cohort / File(s) Summary
Bypass Resolver Implementation
client/internal/dns/mgmt/bypass_resolver.go
New NewBypassResolver(nameservers []netip.Addr) *net.Resolver. Filters invalid/loopback/unspecified IPs, returns nil if none; builds net.Resolver{PreferGo:true} with custom Dial that sequentially dials <ip>:53 via the client dialer and returns the first successful connection or a wrapped last-error.
Mgmt Resolver Configuration
client/internal/dns/mgmt/mgmt.go
Introduces Resolver.bypassResolver, public SetBypassResolver, and GetPoolRootDomains. osLookup, lookupBoth, and lookupRecords prefer the bypass resolver when set. ServeDNS detects relay pool-root subdomain queries, runs singleflight-deduped on‑demand resolution, caches A/AAAA records, and writes responses on success.
DNS Server Integration
client/internal/dns/server.go
UpdateServerConfig registers pool-root domains using subdomain-matching wrappers and exact-match handlers for others; tracks mgmtPoolRoots counts to reconcile extraDomains. registerFallback records selected fallback nameservers and installs/clears the mgmt bypass resolver via SetBypassResolver(...).

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client as Client
    participant Handler as DNS Handler
    participant Mgmt as Mgmt Cache Resolver
    participant Bypass as Bypass Resolver
    participant NS as Remote Nameserver

    Client->>Handler: DNS query (pool-root subdomain)
    Handler->>Mgmt: ServeDNS (cache miss)
    Mgmt->>Mgmt: Detect pool-root subdomain
    Mgmt->>Bypass: lookupRecords (singleflight)
    Bypass->>NS: Dial <ip>:53 (try list sequentially)
    alt Dial success
        NS-->>Bypass: TCP/UDP conn established
        Bypass->>NS: DNS query
        NS-->>Bypass: DNS response
        Bypass-->>Mgmt: A/AAAA records
        Mgmt->>Mgmt: Cache records
        Mgmt-->>Handler: DNS response
    else All dials fail
        Bypass-->>Mgmt: Dial error
        Mgmt-->>Handler: Defer to next handler / fallback
    end
    Handler-->>Client: DNS response or fallback
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • mlsmaycon

Poem

🐰 I hop through nameservers, one by one,
I try each gate until a query's done.
I cache the answers, swift and spry,
Subdomains found beneath the sky.
Small paws, neat routes — DNS done! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: fixing a mgmt cache bypass overlay issue that addresses DNS resolution deadlocks in exit-node scenarios.
Description check ✅ Passed The description is comprehensive and well-structured, covering the problem, solution, and implementation details across multiple subsections, with bug fix checkbox marked.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-mgmt-cache-bypass-overlay

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@pappz pappz changed the title Fix mgmt cache bypass overlay [client] Fix mgmt cache bypass overlay Apr 24, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
client/internal/dns/mgmt/bypass_resolver.go (1)

39-53: Consider a per-server timeout to prevent one slow nameserver from exhausting the parent context deadline.

The net.Resolver with PreferGo: true calls Dial once per server without built-in per-server timeout protection. Sequential iteration means an unreachable or blackholed nameserver can consume the entire ctx deadline before the next one is attempted, defeating the point of failover. A short per-server timeout (e.g., 2–3 seconds) derived from the parent context preserves failover semantics while keeping one slow resolver from blocking the others. Additionally, nbnet.NewDialer() can be hoisted outside the closure since net.Dialer is safe for concurrent reuse and this avoids redundant allocations on every call.

♻️ Suggested refinement
-	return &net.Resolver{
-		PreferGo: true,
-		Dial: func(ctx context.Context, network, _ string) (net.Conn, error) {
-			nbDialer := nbnet.NewDialer()
-			var lastErr error
-			for _, ns := range servers {
-				conn, err := nbDialer.DialContext(ctx, network, ns)
-				if err == nil {
-					return conn, nil
-				}
-				lastErr = err
-			}
-			if lastErr == nil {
-				return nil, fmt.Errorf("no bypass nameservers configured")
-			}
-			return nil, fmt.Errorf("dial bypass nameservers: %w", lastErr)
-		},
-	}
+	nbDialer := nbnet.NewDialer()
+	const perServerTimeout = 2 * time.Second
+	return &net.Resolver{
+		PreferGo: true,
+		Dial: func(ctx context.Context, network, _ string) (net.Conn, error) {
+			var lastErr error
+			for _, ns := range servers {
+				dialCtx, cancel := context.WithTimeout(ctx, perServerTimeout)
+				conn, err := nbDialer.DialContext(dialCtx, network, ns)
+				cancel()
+				if err == nil {
+					return conn, nil
+				}
+				lastErr = err
+				if ctx.Err() != nil {
+					break
+				}
+			}
+			if lastErr == nil {
+				return nil, fmt.Errorf("no bypass nameservers configured")
+			}
+			return nil, fmt.Errorf("dial bypass nameservers: %w", lastErr)
+		},
+	}

(Requires "time" import.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/bypass_resolver.go` around lines 39 - 53, The Dial
closure currently creates a new nbnet.Dialer per call and iterates servers
without per-server timeouts, allowing a slow nameserver to exhaust the parent
ctx; hoist nbnet.NewDialer() outside the Dial func for reuse and, inside the
Dial implementation (the anonymous func assigned to Dial), for each ns create a
child context with a short timeout (e.g., 2–3s) via context.WithTimeout(ctx,
timeout), use that child context when calling nbDialer.DialContext, ensure you
call the cancel func to avoid leaks, collect the last error as before, and
return a wrapped error if all servers fail; keep the existing messages ("no
bypass nameservers configured" and "dial bypass nameservers: %w") and preserve
the servers iteration and lastErr handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 227-268: resolveOnDemand currently allows N concurrent upstream
lookups because the TOCTOU cache write only dedups writes; wrap the
lookupRecords call with the Resolver's refreshGroup singleflight by calling
m.refreshGroup.Do with a namespaced key (e.g.
"ondemand:"+question.Name+":"+strconv.Itoa(int(question.Qtype))) so concurrent
resolveOnDemand callers collapse into one lookup; inside the Do function perform
lookupRecords(ctx, d, question) and return the records or error, then outside
the Do retrieve the shared result (type-assert to the records slice), handle
errors exactly as before, and perform the single cache write into m.records (use
the same exists check + mutex) and reply to the client—this prevents parallel
upstream queries while preserving existing caching and error paths for
resolveOnDemand, lookupRecords, m.records, and refreshGroup.

In `@client/internal/dns/server.go`:
- Around line 635-654: Canonicalize pool root entries before building
poolRootSet so they match the form returned by
s.mgmtCacheResolver.GetCachedDomains(): call the same normalization used
elsewhere (e.g., toZone(d) / dns.Fqdn(strings.ToLower(...))) when iterating
s.mgmtCacheResolver.GetPoolRootDomains() and use that value as the map key; then
the subsequent loop that checks if _, isPool := poolRootSet[d]; isPool will
correctly dedupe and avoid appending pool roots into exactDomains, preserving
the earlier subdomainMatchHandler registration (affecting poolRootSet,
GetPoolRootDomains, GetCachedDomains, toZone, registerHandler,
subdomainMatchHandler, PriorityMgmtCache).

---

Nitpick comments:
In `@client/internal/dns/mgmt/bypass_resolver.go`:
- Around line 39-53: The Dial closure currently creates a new nbnet.Dialer per
call and iterates servers without per-server timeouts, allowing a slow
nameserver to exhaust the parent ctx; hoist nbnet.NewDialer() outside the Dial
func for reuse and, inside the Dial implementation (the anonymous func assigned
to Dial), for each ns create a child context with a short timeout (e.g., 2–3s)
via context.WithTimeout(ctx, timeout), use that child context when calling
nbDialer.DialContext, ensure you call the cancel func to avoid leaks, collect
the last error as before, and return a wrapped error if all servers fail; keep
the existing messages ("no bypass nameservers configured" and "dial bypass
nameservers: %w") and preserve the servers iteration and lastErr handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b3835e7e-375f-4ce6-ab1f-b093ea24c535

📥 Commits

Reviewing files that changed from the base of the PR and between d6f08e4 and 14be474.

📒 Files selected for processing (3)
  • client/internal/dns/mgmt/bypass_resolver.go
  • client/internal/dns/mgmt/mgmt.go
  • client/internal/dns/server.go

Comment thread client/internal/dns/mgmt/mgmt.go
Comment thread client/internal/dns/server.go Outdated
pappz added 3 commits April 24, 2026 22:33
GetPoolRootDomains returns what the extractor stored; GetCachedDomains
strips the trailing dot from question names. Today those happen to
produce the same domain.Domain string because both sources run through
domain.FromString, but relying on that parity is fragile.

Run both sides of the poolRootSet membership check through toZone so
the comparison is independent of each source's canonical form.
Prevents a future extractor or cache-entry path from silently routing
a pool-root into exactDomains, where a second registerHandler call
with the bare resolver would clobber the subdomainMatchHandler
registration (AddHandler replaces entries with the same pattern +
priority).
resolveOnDemand deduplicated the cache write but not the upstream
lookup — a burst of concurrent queries for a freshly-learned pool-root
subdomain (A + AAAA from a single getaddrinfo, or multiple peer
workers dialing the same foreign relay) each hit the bypass resolver
independently.

Wrap the lookupRecords call in refreshGroup.Do with an 'ondemand:'-
prefixed key so concurrent callers collapse into one request; the
prefix namespaces this key off scheduleRefresh's keyspace so the two
paths don't collide.

Error and empty-result paths preserved; the cache write still runs
under the existing TOCTOU-safe exists check + mutex so a winning
refresh isn't clobbered.
UpdateServerConfig was at 31; SonarCloud threshold is 20. The deep
nesting came from wrapping the whole body in the mgmtCacheResolver
nil-check and stacking five for-loops + conditionals inside it.

Invert to an early return when the resolver is nil, then split the
remaining work into two focused helpers:

- registerMgmtCacheHandlers: handler-chain wiring (pool-root
  subdomain wrapper + exact-match registration) with the toZone-
  canonicalized membership check.
- reconcileMgmtPoolRoots: extraDomains refcounting against the
  current pool-root set, plus the mgmtPoolRoots tracking map.

Pure refactor — behaviour is identical.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 205-222: isUnderPoolRoot currently lowercases and trims trailing
dots but does not apply the same canonicalization used during handler
registration; update isUnderPoolRoot to call the shared toZone (which uses
nbdns.NormalizeZone to strip wildcard prefixes) on both fqdn and each relay root
before comparing so membership checks match registerMgmtCacheHandlers
canonicalization; locate Resolver.isUnderPoolRoot and replace the inline
normalization with calls to toZone (or inline the same NormalizeZone logic if
toZone is not directly accessible) ensuring you handle empty/invalid zones the
same way registerMgmtCacheHandlers does.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 28d56ddb-fe6c-4d39-a186-ff0b8c377058

📥 Commits

Reviewing files that changed from the base of the PR and between 14be474 and b0b52b6.

📒 Files selected for processing (2)
  • client/internal/dns/mgmt/mgmt.go
  • client/internal/dns/server.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/dns/server.go

Comment thread client/internal/dns/mgmt/mgmt.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
client/internal/dns/server.go (1)

424-442: mgmtPoolRoots not cleared on Stop.

Stop() calls maps.Clear(s.extraDomains) but does not clear s.mgmtPoolRoots. If the server is ever restarted in-place (not typical but the field outlives a Stop), the next reconcileMgmtPoolRoots will decrement counts in a freshly-cleared extraDomains map, producing negative values which then get deleted on the next tick — not a correctness bug today, but the two tracking maps intentionally mirror each other per the comment on line 120-124, so keeping them in sync on Stop matches that intent.

🧹 Proposed cleanup
 	maps.Clear(s.extraDomains)
+	maps.Clear(s.mgmtPoolRoots)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/server.go` around lines 424 - 442, The Stop() method
clears s.extraDomains but forgets to clear s.mgmtPoolRoots, leaving the two
tracking maps out of sync; update Stop() to also clear s.mgmtPoolRoots (same
place you call maps.Clear(s.extraDomains)) so that s.mgmtPoolRoots is reset
before any potential restart and before reconcileMgmtPoolRoots runs, ensuring
both maps mirror each other as intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/internal/dns/server.go`:
- Around line 424-442: The Stop() method clears s.extraDomains but forgets to
clear s.mgmtPoolRoots, leaving the two tracking maps out of sync; update Stop()
to also clear s.mgmtPoolRoots (same place you call maps.Clear(s.extraDomains))
so that s.mgmtPoolRoots is reset before any potential restart and before
reconcileMgmtPoolRoots runs, ensuring both maps mirror each other as intended.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6b74075e-7950-4395-a322-ec6a272842a7

📥 Commits

Reviewing files that changed from the base of the PR and between b0b52b6 and 12d0eda.

📒 Files selected for processing (1)
  • client/internal/dns/server.go

isUnderPoolRoot lowercased and trimmed the trailing dot, but did not
strip a leading "*." wildcard the way server.toZone does via
nbdns.NormalizeZone. If a mgmt-advertised Relay URL ever comes through
as "*.relay.netbird.io", the handler-chain registration side strips
the wildcard (toZone) but the membership check here would keep it,
so HasSuffix(".*.relay.netbird.io") would never match legitimate
instance subdomains and on-demand resolves would not fire.

Today the extractor lowercases + IDNA-normalizes URLs and rejects the
wildcard form, so the divergence is latent. Close it anyway by running
both sides of the membership check through a shared
canonicalizePoolDomain helper that mirrors toZone's transformation
(modulo trailing-dot orientation, which is self-consistent within
this function). toZone itself lives in the parent dns package and
cannot be imported here without a cycle.
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
client/internal/dns/mgmt/mgmt.go (1)

466-518: Avoid double-reading bypassResolver; pass it into osLookup.

lookupBoth / lookupRecords read m.bypassResolver to pick a branch, and then osLookup (lines 569-574) reads it again under its own RLock. Two small downsides:

  1. If SetBypassResolver(nil) ever races between the outer read and the inner read, the OS path silently runs against net.DefaultResolver without the caller arming the loop detector (markRefreshing) — the branch predicate and the resolver chosen can disagree. In practice SetBypassResolver is called once at setup so this is theoretical, but the invariant is easy to break later.
  2. Redundant RLock on the hot path.

Thread the already-read resolver into osLookup so the decision and the dial agree, and drop the second lock:

♻️ Proposed refactor
-func (m *Resolver) osLookup(ctx context.Context, d domain.Domain, dnsName string, qtype uint16) ([]dns.RR, error) {
+func (m *Resolver) osLookup(ctx context.Context, d domain.Domain, dnsName string, qtype uint16) ([]dns.RR, error) {
+	m.mutex.RLock()
+	resolver := m.bypassResolver
+	m.mutex.RUnlock()
+	if resolver == nil {
+		resolver = net.DefaultResolver
+	}
+	return m.osLookupWith(ctx, resolver, d, dnsName, qtype)
+}
+
+func (m *Resolver) osLookupWith(ctx context.Context, resolver *net.Resolver, d domain.Domain, dnsName string, qtype uint16) ([]dns.RR, error) {
 	network := resutil.NetworkForQtype(qtype)
 	if network == "" {
 		return nil, fmt.Errorf("unsupported qtype %s", dns.TypeToString[qtype])
 	}
 	log.Infof("looking up IP for mgmt domain=%s type=%s", d.SafeString(), dns.TypeToString[qtype])
 	defer log.Infof("done looking up IP for mgmt domain=%s type=%s", d.SafeString(), dns.TypeToString[qtype])
-	m.mutex.RLock()
-	resolver := m.bypassResolver
-	m.mutex.RUnlock()
-	if resolver == nil {
-		resolver = net.DefaultResolver
-	}
 	result := resutil.LookupIP(ctx, resolver, network, d.PunycodeString(), qtype)
 	...
 }

Then in lookupBoth / lookupRecords, switch the if bypass != nil branches to call osLookupWith(ctx, bypass, ...) directly (and keep the existing osLookup(...) calls on the legacy OS-fallback path so they arm markRefreshing as today).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/mgmt.go` around lines 466 - 518, lookupBoth and
lookupRecords read m.bypassResolver then call osLookup which re-reads
m.bypassResolver under its own RLock, risking a race and extra locking; refactor
by adding an osLookupWith(ctx, bypassResolver, ...) (or a bypass parameter to
osLookup) that uses the already-read bypass resolver instead of re-locking,
update lookupBoth and lookupRecords to call osLookupWith when bypass != nil,
remove the redundant RLock/read inside osLookupWhenProvided (or branch inside
osLookup to skip the second lock when a non-nil bypass parameter is supplied),
and ensure callers that rely on the legacy behavior (the OS-fallback path that
needs markRefreshing) continue to call the original osLookup so markRefreshing
semantics are preserved; update references to osLookup, osLookupWith,
lookupBoth, lookupRecords and SetBypassResolver accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 466-518: lookupBoth and lookupRecords read m.bypassResolver then
call osLookup which re-reads m.bypassResolver under its own RLock, risking a
race and extra locking; refactor by adding an osLookupWith(ctx, bypassResolver,
...) (or a bypass parameter to osLookup) that uses the already-read bypass
resolver instead of re-locking, update lookupBoth and lookupRecords to call
osLookupWith when bypass != nil, remove the redundant RLock/read inside
osLookupWhenProvided (or branch inside osLookup to skip the second lock when a
non-nil bypass parameter is supplied), and ensure callers that rely on the
legacy behavior (the OS-fallback path that needs markRefreshing) continue to
call the original osLookup so markRefreshing semantics are preserved; update
references to osLookup, osLookupWith, lookupBoth, lookupRecords and
SetBypassResolver accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 623e1848-d1aa-425e-a017-e35467008070

📥 Commits

Reviewing files that changed from the base of the PR and between 12d0eda and 3cdfa11.

📒 Files selected for processing (1)
  • client/internal/dns/mgmt/mgmt.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant