[client] Fix mgmt cache bypass overlay by pappz · Pull Request #5987 · netbirdio/netbird

pappz · 2026-04-24T19:52:06Z

Describe your changes

Exit-Node Peer Routing Issue and Fix

Problem

When an exit-node peer installs a 0.0.0.0/0 default route on wt0 before its WireGuard peer entry is active, any socket dialing an off-link address via that route returns kernel ENOKEY.

This deadlocks a specific path: when a remote peer is homed on a different relay instance than the local daemon, the daemon must dial a foreign relay FQDN (e.g. streamline-de-fra1-0.relay.netbird.io). That FQDN is not in the mgmt cache, so it falls through to the normal upstream handler — which takes the overlay-routed dial and fails. No DNS → no relay → no peer handshake → no DNS.

Deterministic, but only triggers when the relay load-balancer puts peers on different instances. Caused the intermittent exit-node e2e failures on rocky9 / debian-12.

Fix

mgmt.Resolver bypass mechanism: mgmt.Resolver gets a dedicated *net.Resolver whose Dial uses nbnet.NewDialer(). On Linux the socket gets SO_MARK=ControlPlaneMark, on darwin IP_BOUND_IF, on Windows IP_UNICAST_IF — the same bypass mechanism gRPC/STUN/TURN/relay-websocket already use. Policy routes keep it on the underlay interface.
Original nameservers: Dialer targets the original pre-NetBird system nameservers, captured from the registerFallback flow. No hardcoded fallback IPs; if the host had no original resolver, the bypass resolver stays nil and the stale-while-revalidate cache serves.
Pool-root domains: Pool-root domains (ServerDomains.Relay[], e.g. relay.netbird.io) are registered through a thin subdomainMatchHandler wrapper, so instance subdomains (streamline-*.relay.netbird.io) route to the mgmt cache.
Cache miss resolution: On cache miss under a pool root, ServeDNS resolves on demand via the bypass resolver, caches the result, and returns it.
OS DNS manager integration: Pool-roots are also pushed into extraDomains so applyHostConfig advertises them as match domains to systemd-resolved / NetworkManager / scutil. Without this, those OS DNS managers would answer the query from the host's global upstream, skipping the handler chain entirely — which is exactly what happened on ubuntu-20.04.

Issue ticket number and link

Stack

Checklist

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

I added/updated documentation for this change
Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

New Features
- On-demand resolution for instance subdomains under relay domains with deduplicated lookups and cached A/AAAA answers for faster responses.
- Configurable bypass resolver to route DNS queries directly to specified upstream nameservers with robust dialing and failover behavior.
Bug Fixes
- Improved handling and tracking of relay “pool-root” domains during config updates to avoid stale registrations.
- More reliable nameserver fallback, tracking, and failover when upstreams change or are absent.

When an exit-node peer's network-map installs a 0.0.0.0/0 default route on the overlay interface before that peer's WireGuard key material is active, any UDP socket dialing an off-link address is routed into wt0 and the kernel returns ENOKEY. Two places needed fixing: 1. The mgmt cache refresh path. It reactively refreshes the control-plane FQDNs advertised by the mgmt (api/signal/stun/turn/ the Relay pool root) after the daemon has installed its own resolv.conf pointing at the overlay listener. Previously the refresh dial followed the chain's upstream handler, which followed the overlay default route and deadlocked on ENOKEY. 2. Foreign relay FQDN resolution. When a remote peer is homed on a different relay instance than us, we need to resolve a streamline-* subdomain that is not in the cache. That lookup went through the same overlay-routed upstream and failed identically, deadlocking the exit-node test whenever the relay LB put the two peers on different instances. Fix both by giving the mgmt cache a dedicated net.Resolver that dials the original pre-NetBird system nameservers through nbnet.NewDialer. The dialer marks the socket as control-plane (SO_MARK on Linux, IP_BOUND_IF on darwin, IP_UNICAST_IF on Windows); the routemanager's policy rules keep those sockets on the underlay regardless of the overlay default. Pool-root domains (the Relay entries in ServerDomains) now register through a subdomain-matching wrapper so that instance subdomains like streamline-de-fra1-0.relay.netbird.io also hit the mgmt cache handler. On cache miss under a pool root, ServeDNS resolves the FQDN on demand through the bypass resolver, caches the result, and returns it. Pool-root membership is derived dynamically from mgmt-advertised ServerDomains.Relay[] — no hardcoded domain lists, no protocol change. No hardcoded fallback nameservers: if the host had no original system resolver at all, the bypass resolver stays nil and the stale-while- revalidate cache keeps serving. The general upstream forwarder and the user DNS path are unchanged.

The subdomain-matching handler chain entry only helps when the OS DNS manager routes the query to the daemon's listener. On systems using systemd-resolved (Ubuntu 20.04 / 22.04, most modern Debian/RHEL-family distros), the daemon delegates a *specific set* of match domains to the wt0 link via D-Bus. Any FQDN outside that set is answered from the host's global upstream DNS, never touching the handler chain. With only 'netbird.cloud' and the in-addr.arpa reverse zone advertised as match domains, a relay instance FQDN like streamline-de-fra1-0.relay.netbird.io is resolved globally. When an exit-node default route is active on wt0 but the peer is not yet live, that global lookup times out on the local stub (127.0.0.53) because systemd-resolved's uplink DNS is unreachable through the overlay, and the foreign-relay dial fails exactly the way the ENOKEY path did on file-manager hosts. Bump the pool-root domains (ServerDomains.Relay entries) into extraDomains so applyHostConfig adds them as match-only domains alongside 'netbird.cloud'. systemd-resolved then delegates *.relay.netbird.io to the wt0 link, queries hit the daemon's DNS listener, and the on-demand resolve path via the bypass resolver runs as intended. Tracking map mgmtPoolRoots isolates this refcount from the RegisterHandler path so updates across successive mgmt syncs increment/decrement only the changed set.

coderabbitai · 2026-04-24T19:52:21Z

📝 Walkthrough

Walkthrough

Adds a bypass DNS resolver factory and integrates it into the management resolver and DNS server: on cache misses for relay pool-root subdomains the mgmt resolver can perform singleflight on‑demand bypass lookups, cache A/AAAA results, and server config tracks/registers pool-root domains separately.

Changes

Cohort / File(s)	Summary
Bypass Resolver Implementation `client/internal/dns/mgmt/bypass_resolver.go`	New `NewBypassResolver(nameservers []netip.Addr) *net.Resolver`. Filters invalid/loopback/unspecified IPs, returns `nil` if none; builds `net.Resolver{PreferGo:true}` with custom `Dial` that sequentially dials `<ip>:53` via the client dialer and returns the first successful connection or a wrapped last-error.
Mgmt Resolver Configuration `client/internal/dns/mgmt/mgmt.go`	Introduces `Resolver.bypassResolver`, public `SetBypassResolver`, and `GetPoolRootDomains`. `osLookup`, `lookupBoth`, and `lookupRecords` prefer the bypass resolver when set. `ServeDNS` detects relay pool-root subdomain queries, runs singleflight-deduped on‑demand resolution, caches A/AAAA records, and writes responses on success.
DNS Server Integration `client/internal/dns/server.go`	`UpdateServerConfig` registers pool-root domains using subdomain-matching wrappers and exact-match handlers for others; tracks `mgmtPoolRoots` counts to reconcile `extraDomains`. `registerFallback` records selected fallback nameservers and installs/clears the mgmt bypass resolver via `SetBypassResolver(...)`.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client as Client
    participant Handler as DNS Handler
    participant Mgmt as Mgmt Cache Resolver
    participant Bypass as Bypass Resolver
    participant NS as Remote Nameserver

    Client->>Handler: DNS query (pool-root subdomain)
    Handler->>Mgmt: ServeDNS (cache miss)
    Mgmt->>Mgmt: Detect pool-root subdomain
    Mgmt->>Bypass: lookupRecords (singleflight)
    Bypass->>NS: Dial <ip>:53 (try list sequentially)
    alt Dial success
        NS-->>Bypass: TCP/UDP conn established
        Bypass->>NS: DNS query
        NS-->>Bypass: DNS response
        Bypass-->>Mgmt: A/AAAA records
        Mgmt->>Mgmt: Cache records
        Mgmt-->>Handler: DNS response
    else All dials fail
        Bypass-->>Mgmt: Dial error
        Mgmt-->>Handler: Defer to next handler / fallback
    end
    Handler-->>Client: DNS response or fallback

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

[client] Exclude Flow domain from caching to prevent TLS failures #5433 — Modifies DNS management/caching logic and interacts with pool-root/subdomain caching behavior similar to these changes.

Suggested reviewers

mlsmaycon

Poem

🐰 I hop through nameservers, one by one,
I try each gate until a query's done.
I cache the answers, swift and spry,
Subdomains found beneath the sky.
Small paws, neat routes — DNS done! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: fixing a mgmt cache bypass overlay issue that addresses DNS resolution deadlocks in exit-node scenarios.
Description check	✅ Passed	The description is comprehensive and well-structured, covering the problem, solution, and implementation details across multiple subsections, with bug fix checkbox marked.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-mgmt-cache-bypass-overlay

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

client/internal/dns/mgmt/bypass_resolver.go (1)

39-53: Consider a per-server timeout to prevent one slow nameserver from exhausting the parent context deadline.

The net.Resolver with PreferGo: true calls Dial once per server without built-in per-server timeout protection. Sequential iteration means an unreachable or blackholed nameserver can consume the entire ctx deadline before the next one is attempted, defeating the point of failover. A short per-server timeout (e.g., 2–3 seconds) derived from the parent context preserves failover semantics while keeping one slow resolver from blocking the others. Additionally, nbnet.NewDialer() can be hoisted outside the closure since net.Dialer is safe for concurrent reuse and this avoids redundant allocations on every call.

♻️ Suggested refinement

-	return &net.Resolver{
-		PreferGo: true,
-		Dial: func(ctx context.Context, network, _ string) (net.Conn, error) {
-			nbDialer := nbnet.NewDialer()
-			var lastErr error
-			for _, ns := range servers {
-				conn, err := nbDialer.DialContext(ctx, network, ns)
-				if err == nil {
-					return conn, nil
-				}
-				lastErr = err
-			}
-			if lastErr == nil {
-				return nil, fmt.Errorf("no bypass nameservers configured")
-			}
-			return nil, fmt.Errorf("dial bypass nameservers: %w", lastErr)
-		},
-	}
+	nbDialer := nbnet.NewDialer()
+	const perServerTimeout = 2 * time.Second
+	return &net.Resolver{
+		PreferGo: true,
+		Dial: func(ctx context.Context, network, _ string) (net.Conn, error) {
+			var lastErr error
+			for _, ns := range servers {
+				dialCtx, cancel := context.WithTimeout(ctx, perServerTimeout)
+				conn, err := nbDialer.DialContext(dialCtx, network, ns)
+				cancel()
+				if err == nil {
+					return conn, nil
+				}
+				lastErr = err
+				if ctx.Err() != nil {
+					break
+				}
+			}
+			if lastErr == nil {
+				return nil, fmt.Errorf("no bypass nameservers configured")
+			}
+			return nil, fmt.Errorf("dial bypass nameservers: %w", lastErr)
+		},
+	}

(Requires "time" import.)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/bypass_resolver.go` around lines 39 - 53, The Dial
closure currently creates a new nbnet.Dialer per call and iterates servers
without per-server timeouts, allowing a slow nameserver to exhaust the parent
ctx; hoist nbnet.NewDialer() outside the Dial func for reuse and, inside the
Dial implementation (the anonymous func assigned to Dial), for each ns create a
child context with a short timeout (e.g., 2–3s) via context.WithTimeout(ctx,
timeout), use that child context when calling nbDialer.DialContext, ensure you
call the cancel func to avoid leaks, collect the last error as before, and
return a wrapped error if all servers fail; keep the existing messages ("no
bypass nameservers configured" and "dial bypass nameservers: %w") and preserve
the servers iteration and lastErr handling.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 227-268: resolveOnDemand currently allows N concurrent upstream
lookups because the TOCTOU cache write only dedups writes; wrap the
lookupRecords call with the Resolver's refreshGroup singleflight by calling
m.refreshGroup.Do with a namespaced key (e.g.
"ondemand:"+question.Name+":"+strconv.Itoa(int(question.Qtype))) so concurrent
resolveOnDemand callers collapse into one lookup; inside the Do function perform
lookupRecords(ctx, d, question) and return the records or error, then outside
the Do retrieve the shared result (type-assert to the records slice), handle
errors exactly as before, and perform the single cache write into m.records (use
the same exists check + mutex) and reply to the client—this prevents parallel
upstream queries while preserving existing caching and error paths for
resolveOnDemand, lookupRecords, m.records, and refreshGroup.

In `@client/internal/dns/server.go`:
- Around line 635-654: Canonicalize pool root entries before building
poolRootSet so they match the form returned by
s.mgmtCacheResolver.GetCachedDomains(): call the same normalization used
elsewhere (e.g., toZone(d) / dns.Fqdn(strings.ToLower(...))) when iterating
s.mgmtCacheResolver.GetPoolRootDomains() and use that value as the map key; then
the subsequent loop that checks if _, isPool := poolRootSet[d]; isPool will
correctly dedupe and avoid appending pool roots into exactDomains, preserving
the earlier subdomainMatchHandler registration (affecting poolRootSet,
GetPoolRootDomains, GetCachedDomains, toZone, registerHandler,
subdomainMatchHandler, PriorityMgmtCache).

---

Nitpick comments:
In `@client/internal/dns/mgmt/bypass_resolver.go`:
- Around line 39-53: The Dial closure currently creates a new nbnet.Dialer per
call and iterates servers without per-server timeouts, allowing a slow
nameserver to exhaust the parent ctx; hoist nbnet.NewDialer() outside the Dial
func for reuse and, inside the Dial implementation (the anonymous func assigned
to Dial), for each ns create a child context with a short timeout (e.g., 2–3s)
via context.WithTimeout(ctx, timeout), use that child context when calling
nbDialer.DialContext, ensure you call the cancel func to avoid leaks, collect
the last error as before, and return a wrapped error if all servers fail; keep
the existing messages ("no bypass nameservers configured" and "dial bypass
nameservers: %w") and preserve the servers iteration and lastErr handling.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b3835e7e-375f-4ce6-ab1f-b093ea24c535

📥 Commits

Reviewing files that changed from the base of the PR and between d6f08e4 and 14be474.

📒 Files selected for processing (3)

client/internal/dns/mgmt/bypass_resolver.go
client/internal/dns/mgmt/mgmt.go
client/internal/dns/server.go

GetPoolRootDomains returns what the extractor stored; GetCachedDomains strips the trailing dot from question names. Today those happen to produce the same domain.Domain string because both sources run through domain.FromString, but relying on that parity is fragile. Run both sides of the poolRootSet membership check through toZone so the comparison is independent of each source's canonical form. Prevents a future extractor or cache-entry path from silently routing a pool-root into exactDomains, where a second registerHandler call with the bare resolver would clobber the subdomainMatchHandler registration (AddHandler replaces entries with the same pattern + priority).

resolveOnDemand deduplicated the cache write but not the upstream lookup — a burst of concurrent queries for a freshly-learned pool-root subdomain (A + AAAA from a single getaddrinfo, or multiple peer workers dialing the same foreign relay) each hit the bypass resolver independently. Wrap the lookupRecords call in refreshGroup.Do with an 'ondemand:'- prefixed key so concurrent callers collapse into one request; the prefix namespaces this key off scheduleRefresh's keyspace so the two paths don't collide. Error and empty-result paths preserved; the cache write still runs under the existing TOCTOU-safe exists check + mutex so a winning refresh isn't clobbered.

UpdateServerConfig was at 31; SonarCloud threshold is 20. The deep nesting came from wrapping the whole body in the mgmtCacheResolver nil-check and stacking five for-loops + conditionals inside it. Invert to an early return when the resolver is nil, then split the remaining work into two focused helpers: - registerMgmtCacheHandlers: handler-chain wiring (pool-root subdomain wrapper + exact-match registration) with the toZone- canonicalized membership check. - reconcileMgmtPoolRoots: extraDomains refcounting against the current pool-root set, plus the mgmtPoolRoots tracking map. Pure refactor — behaviour is identical.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 205-222: isUnderPoolRoot currently lowercases and trims trailing
dots but does not apply the same canonicalization used during handler
registration; update isUnderPoolRoot to call the shared toZone (which uses
nbdns.NormalizeZone to strip wildcard prefixes) on both fqdn and each relay root
before comparing so membership checks match registerMgmtCacheHandlers
canonicalization; locate Resolver.isUnderPoolRoot and replace the inline
normalization with calls to toZone (or inline the same NormalizeZone logic if
toZone is not directly accessible) ensuring you handle empty/invalid zones the
same way registerMgmtCacheHandlers does.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 28d56ddb-fe6c-4d39-a186-ff0b8c377058

📥 Commits

Reviewing files that changed from the base of the PR and between 14be474 and b0b52b6.

📒 Files selected for processing (2)

client/internal/dns/mgmt/mgmt.go
client/internal/dns/server.go

🚧 Files skipped from review as they are similar to previous changes (1)

client/internal/dns/server.go

coderabbitai

🧹 Nitpick comments (1)

client/internal/dns/server.go (1)
424-442: mgmtPoolRoots not cleared on Stop.

Stop() calls maps.Clear(s.extraDomains) but does not clear s.mgmtPoolRoots. If the server is ever restarted in-place (not typical but the field outlives a Stop), the next reconcileMgmtPoolRoots will decrement counts in a freshly-cleared extraDomains map, producing negative values which then get deleted on the next tick — not a correctness bug today, but the two tracking maps intentionally mirror each other per the comment on line 120-124, so keeping them in sync on Stop matches that intent.
🧹 Proposed cleanup
 	maps.Clear(s.extraDomains)
+	maps.Clear(s.mgmtPoolRoots)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/server.go` around lines 424 - 442, The Stop() method
clears s.extraDomains but forgets to clear s.mgmtPoolRoots, leaving the two
tracking maps out of sync; update Stop() to also clear s.mgmtPoolRoots (same
place you call maps.Clear(s.extraDomains)) so that s.mgmtPoolRoots is reset
before any potential restart and before reconcileMgmtPoolRoots runs, ensuring
both maps mirror each other as intended.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/internal/dns/server.go`:
- Around line 424-442: The Stop() method clears s.extraDomains but forgets to
clear s.mgmtPoolRoots, leaving the two tracking maps out of sync; update Stop()
to also clear s.mgmtPoolRoots (same place you call maps.Clear(s.extraDomains))
so that s.mgmtPoolRoots is reset before any potential restart and before
reconcileMgmtPoolRoots runs, ensuring both maps mirror each other as intended.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6b74075e-7950-4395-a322-ec6a272842a7

📥 Commits

Reviewing files that changed from the base of the PR and between b0b52b6 and 12d0eda.

📒 Files selected for processing (1)

client/internal/dns/server.go

isUnderPoolRoot lowercased and trimmed the trailing dot, but did not strip a leading "*." wildcard the way server.toZone does via nbdns.NormalizeZone. If a mgmt-advertised Relay URL ever comes through as "*.relay.netbird.io", the handler-chain registration side strips the wildcard (toZone) but the membership check here would keep it, so HasSuffix(".*.relay.netbird.io") would never match legitimate instance subdomains and on-demand resolves would not fire. Today the extractor lowercases + IDNA-normalizes URLs and rejects the wildcard form, so the divergence is latent. Close it anyway by running both sides of the membership check through a shared canonicalizePoolDomain helper that mirrors toZone's transformation (modulo trailing-dot orientation, which is self-consistent within this function). toZone itself lives in the parent dns package and cannot be imported here without a cycle.

sonarqubecloud · 2026-04-24T20:51:28Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

🧹 Nitpick comments (1)

client/internal/dns/mgmt/mgmt.go (1)
466-518: Avoid double-reading bypassResolver; pass it into osLookup.

lookupBoth / lookupRecords read m.bypassResolver to pick a branch, and then osLookup (lines 569-574) reads it again under its own RLock. Two small downsides:

If SetBypassResolver(nil) ever races between the outer read and the inner read, the OS path silently runs against net.DefaultResolver without the caller arming the loop detector (markRefreshing) — the branch predicate and the resolver chosen can disagree. In practice SetBypassResolver is called once at setup so this is theoretical, but the invariant is easy to break later.

Redundant RLock on the hot path.

Thread the already-read resolver into osLookup so the decision and the dial agree, and drop the second lock:
♻️ Proposed refactor
-func (m *Resolver) osLookup(ctx context.Context, d domain.Domain, dnsName string, qtype uint16) ([]dns.RR, error) {
+func (m *Resolver) osLookup(ctx context.Context, d domain.Domain, dnsName string, qtype uint16) ([]dns.RR, error) {
+	m.mutex.RLock()
+	resolver := m.bypassResolver
+	m.mutex.RUnlock()
+	if resolver == nil {
+		resolver = net.DefaultResolver
+	}
+	return m.osLookupWith(ctx, resolver, d, dnsName, qtype)
+}
+
+func (m *Resolver) osLookupWith(ctx context.Context, resolver *net.Resolver, d domain.Domain, dnsName string, qtype uint16) ([]dns.RR, error) {
 	network := resutil.NetworkForQtype(qtype)
 	if network == "" {
 		return nil, fmt.Errorf("unsupported qtype %s", dns.TypeToString[qtype])
 	}
 	log.Infof("looking up IP for mgmt domain=%s type=%s", d.SafeString(), dns.TypeToString[qtype])
 	defer log.Infof("done looking up IP for mgmt domain=%s type=%s", d.SafeString(), dns.TypeToString[qtype])
-	m.mutex.RLock()
-	resolver := m.bypassResolver
-	m.mutex.RUnlock()
-	if resolver == nil {
-		resolver = net.DefaultResolver
-	}
 	result := resutil.LookupIP(ctx, resolver, network, d.PunycodeString(), qtype)
 	...
 }
Then in lookupBoth / lookupRecords, switch the if bypass != nil branches to call osLookupWith(ctx, bypass, ...) directly (and keep the existing osLookup(...) calls on the legacy OS-fallback path so they arm markRefreshing as today).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/mgmt/mgmt.go` around lines 466 - 518, lookupBoth and
lookupRecords read m.bypassResolver then call osLookup which re-reads
m.bypassResolver under its own RLock, risking a race and extra locking; refactor
by adding an osLookupWith(ctx, bypassResolver, ...) (or a bypass parameter to
osLookup) that uses the already-read bypass resolver instead of re-locking,
update lookupBoth and lookupRecords to call osLookupWith when bypass != nil,
remove the redundant RLock/read inside osLookupWhenProvided (or branch inside
osLookup to skip the second lock when a non-nil bypass parameter is supplied),
and ensure callers that rely on the legacy behavior (the OS-fallback path that
needs markRefreshing) continue to call the original osLookup so markRefreshing
semantics are preserved; update references to osLookup, osLookupWith,
lookupBoth, lookupRecords and SetBypassResolver accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/internal/dns/mgmt/mgmt.go`:
- Around line 466-518: lookupBoth and lookupRecords read m.bypassResolver then
call osLookup which re-reads m.bypassResolver under its own RLock, risking a
race and extra locking; refactor by adding an osLookupWith(ctx, bypassResolver,
...) (or a bypass parameter to osLookup) that uses the already-read bypass
resolver instead of re-locking, update lookupBoth and lookupRecords to call
osLookupWith when bypass != nil, remove the redundant RLock/read inside
osLookupWhenProvided (or branch inside osLookup to skip the second lock when a
non-nil bypass parameter is supplied), and ensure callers that rely on the
legacy behavior (the OS-fallback path that needs markRefreshing) continue to
call the original osLookup so markRefreshing semantics are preserved; update
references to osLookup, osLookupWith, lookupBoth, lookupRecords and
SetBypassResolver accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 623e1848-d1aa-425e-a017-e35467008070

📥 Commits

Reviewing files that changed from the base of the PR and between 12d0eda and 3cdfa11.

📒 Files selected for processing (1)

client/internal/dns/mgmt/mgmt.go

pappz added 2 commits April 24, 2026 17:40

pappz changed the title ~~Fix mgmt cache bypass overlay~~ [client] Fix mgmt cache bypass overlay Apr 24, 2026

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread client/internal/dns/mgmt/mgmt.go

Comment thread client/internal/dns/server.go Outdated

pappz added 3 commits April 24, 2026 22:33

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread client/internal/dns/mgmt/mgmt.go

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[client] Fix mgmt cache bypass overlay#5987

[client] Fix mgmt cache bypass overlay#5987
pappz wants to merge 6 commits intomainfrom
fix-mgmt-cache-bypass-overlay

pappz commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

sonarqubecloud Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

pappz commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Exit-Node Peer Routing Issue and Fix

Problem

Fix

Issue ticket number and link

Stack

Checklist

Documentation

Docs PR URL (required if "docs added" is checked)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Apr 24, 2026

Quality Gate passed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pappz commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading