[client] Drop DNS probes for passive health projection#5971
[client] Drop DNS probes for passive health projection#5971
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRemoved probe-driven upstream availability APIs and eager probing; introduced per-domain concurrent NS-group "race" resolution with per-upstream health snapshots, a periodic NS-group health projection (grace/streak rules and event emission), route-source wiring (selected+active), and corresponding test and mock updates. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant DNSServer
participant RouteSources as RouteSources(selected, active)
participant NSGroupA as NSGroupRaceA
participant NSGroupB as NSGroupRaceB
participant Upstream as UpstreamResolver
Client->>DNSServer: DNS Query
DNSServer->>RouteSources: selected(), active()
DNSServer->>NSGroupA: start race (uses selected/active)
DNSServer->>NSGroupB: start race (uses selected/active)
alt NSGroupA wins
NSGroupA->>Upstream: query upstream(s)
Upstream-->>NSGroupA: Response + protocol (record UpstreamHealth)
NSGroupA-->>DNSServer: Response
DNSServer->>NSGroupB: cancel
DNSServer-->>Client: Answer
else NSGroupB wins
NSGroupB->>Upstream: query upstream(s)
Upstream-->>NSGroupB: Response + protocol (record UpstreamHealth)
NSGroupB-->>DNSServer: Response
DNSServer->>NSGroupA: cancel
DNSServer-->>Client: Answer
else All fail
DNSServer-->>Client: SERVFAIL
end
DNSServer->>DNSServer: periodic refreshHealth tick
DNSServer->>DNSServer: collect UpstreamHealth snapshots
DNSServer->>DNSServer: projectNSGroupHealth -> emit WARNING/RECOVERY events
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
client/internal/dns/upstream_ios.go (1)
48-79:⚠️ Potential issue | 🟠 MajorPropagate
ctxso iOS race losers can be cancelled.Line 79 still passes
nilintoExchangeWithFallback, so the new race cancellation cannot interrupt in-flight iOS upstream exchanges after another race wins.🐛 Proposed fix
- return ExchangeWithFallback(nil, client, r, upstream) + return ExchangeWithFallback(ctx, client, r, upstream)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream_ios.go` around lines 48 - 79, The exchange method in upstreamResolverIOS currently passes nil as the context to ExchangeWithFallback which prevents cancellation of iOS in-flight upstream queries; update the call in exchange (the final return line) to pass the incoming ctx instead of nil so ExchangeWithFallback and any underlying network operations can observe cancellation; ensure you only change the argument to ExchangeWithFallback (in upstreamResolverIOS.exchange) from nil to ctx and verify that ExchangeWithFallback's signature accepts a context (update its signature if needed).
🧹 Nitpick comments (4)
client/internal/dns/server.go (2)
1125-1135: "routes" in the doc actually counts HA groups, not routes.
len(snap.selected)is the number ofHAUniqueIDkeys in theroute.HAMap, not the number ofRouteentries under them. In practice each HA group corresponds to one logical route so the behaviour is fine, but the comment's "+1s per 100 routes" reads as counting individual routes. Consider tightening the comment to "per 100 selected HA groups" to prevent confusion if someone later revisits scaling.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/server.go` around lines 1125 - 1135, The comment for warningDelay is misleading: the integer passed (routeCount) represents selected HA groups (HAUniqueID keys), not individual Route entries; update the function comment above warningDelay (and any mention of "+1s per 100 routes") to say "per 100 selected HA groups" (or "per 100 HAUniqueID entries") and clarify that len(snap.selected) counts HA groups so readers won't confuse it with total Route entries; keep the rest of the docstring intact and reference the function name warningDelay and the routeCount parameter in the updated wording.
961-984: Minor inconsistency withregisterFallbackfor the domain arg.
registerFallback(line 781) passesdomain.Domain(nbdns.RootZone)explicitly, but herenbdns.RootZoneis passed bare tonewUpstreamResolver. It compiles today because the const is untyped, but the two call sites should match for grep-ability and to stay robust if the const ever gets a named type.- nbdns.RootZone, + domain.Domain(nbdns.RootZone),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/server.go` around lines 961 - 984, In addHostRootZone, newUpstreamResolver is called with nbdns.RootZone as a raw value which is inconsistent with registerFallback's domain.Domain(nbdns.RootZone) usage; to make the two call sites uniform and robust to future typing changes, change the argument passed to newUpstreamResolver in addHostRootZone to domain.Domain(nbdns.RootZone) (keeping the rest of the call and assignments to handler, handler.selectedRoutes, handler.addRace, and s.registerHandler unchanged).client/internal/dns/upstream.go (1)
166-170: Exported-style camelCase typo in identifier name.
contextWithupstreamProtocolResultshould becontextWithUpstreamProtocolResult(capital U after "With") per Go naming conventions — matches the style already used bysetUpstreamProtocoljust below.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 166 - 170, The function name contextWithupstreamProtocolResult uses incorrect camelCase; rename it to contextWithUpstreamProtocolResult and update its comment and all call sites to match; ensure the returned types and the use with upstreamProtocolKey{} and upstreamProtocolResult remain unchanged and that usages alongside setUpstreamProtocol are updated so compilation and references stay consistent.client/internal/dns/server_test.go (1)
998-1005:ProbeAvailabilitymethod is now dead code.
handlerWithStopno longer requiresProbeAvailabilityafter this PR (it's removed fromserver.go). The method onmockHandlercan be dropped to avoid giving the false impression that probing is still part of the handler contract.func (m *mockHandler) ServeDNS(dns.ResponseWriter, *dns.Msg) {} func (m *mockHandler) Stop() {} -func (m *mockHandler) ProbeAvailability(context.Context) {} func (m *mockHandler) ID() types.HandlerID { return types.HandlerID(m.Id) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/server_test.go` around lines 998 - 1005, The mockHandler struct still defines a ProbeAvailability method even though handlerWithStop no longer requires probing; remove the dead method to avoid implying it’s part of the handler contract. Edit the mockHandler implementation in server_test.go and delete the func (m *mockHandler) ProbeAvailability(context.Context) {} declaration, leaving ServeDNS, Stop, and ID intact (mockHandler, ServeDNS, Stop, ID are the symbols to locate); run tests to confirm no reference to ProbeAvailability remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/upstream_test.go`:
- Around line 391-440: The test currently uses timeoutErr returned immediately
from mockUpstreamResolverPerServer, so it doesn't verify true concurrency;
modify the mock response for the broken server (mockUpstreamResponse for
broken.String()) to block until the resolver's context is cancelled or its
per-query timeout elapses (e.g., sleep/select on ctx.Done()) instead of
returning timeoutErr immediately, so upstreamResolverBase.ServeDNS (and addRace
setup) must observe the blocking behavior and the working group's quick response
wins the race; ensure the mock uses the resolver's ctx or a derived context to
unblock on cancellation/timeouts rather than returning an immediate error.
In `@client/internal/dns/upstream.go`:
- Around line 73-79: The comment is wrong: using raceMinPerUpstreamTimeout as a
floor can let N upstreams exceed raceMaxTotalTimeout; update tryRace to enforce
the total cap by wrapping the entire race loop in an outer
context.WithTimeout(ctx, raceMaxTotalTimeout) and compute per-upstream timeout
as per := max(raceMaxTotalTimeout/numUpstreams, raceMinPerUpstreamTimeout) (use
symbols raceMaxTotalTimeout, raceMinPerUpstreamTimeout, and tryRace to locate
the code), and apply that per to each upstream request; alternatively, if you
prefer minimal change, reword the comment near raceMaxTotalTimeout and
raceMinPerUpstreamTimeout to state the floor may allow total >
raceMaxTotalTimeout for many upstreams and note the intended behavior at the
other occurrence of the same logic around tryRace.
- Around line 308-390: raceAll cancels the shared raceCtx when a winner is
found, causing queryUpstream to see context.Canceled and currently record those
as failures via handleUpstreamError/markUpstreamFail; change queryUpstream to
detect when the provided context was canceled (ctx.Err() == context.Canceled or
errors.Is(err, context.Canceled)) and in that case do NOT call markUpstreamFail
or record a failure—return no failure (nil) so losing-but-cancelled upstreams
don't poison the health map; keep all existing behavior for real errors/timeouts
and continue to call handleUpstreamError/markUpstreamFail for those.
---
Outside diff comments:
In `@client/internal/dns/upstream_ios.go`:
- Around line 48-79: The exchange method in upstreamResolverIOS currently passes
nil as the context to ExchangeWithFallback which prevents cancellation of iOS
in-flight upstream queries; update the call in exchange (the final return line)
to pass the incoming ctx instead of nil so ExchangeWithFallback and any
underlying network operations can observe cancellation; ensure you only change
the argument to ExchangeWithFallback (in upstreamResolverIOS.exchange) from nil
to ctx and verify that ExchangeWithFallback's signature accepts a context
(update its signature if needed).
---
Nitpick comments:
In `@client/internal/dns/server_test.go`:
- Around line 998-1005: The mockHandler struct still defines a ProbeAvailability
method even though handlerWithStop no longer requires probing; remove the dead
method to avoid implying it’s part of the handler contract. Edit the mockHandler
implementation in server_test.go and delete the func (m *mockHandler)
ProbeAvailability(context.Context) {} declaration, leaving ServeDNS, Stop, and
ID intact (mockHandler, ServeDNS, Stop, ID are the symbols to locate); run tests
to confirm no reference to ProbeAvailability remains.
In `@client/internal/dns/server.go`:
- Around line 1125-1135: The comment for warningDelay is misleading: the integer
passed (routeCount) represents selected HA groups (HAUniqueID keys), not
individual Route entries; update the function comment above warningDelay (and
any mention of "+1s per 100 routes") to say "per 100 selected HA groups" (or
"per 100 HAUniqueID entries") and clarify that len(snap.selected) counts HA
groups so readers won't confuse it with total Route entries; keep the rest of
the docstring intact and reference the function name warningDelay and the
routeCount parameter in the updated wording.
- Around line 961-984: In addHostRootZone, newUpstreamResolver is called with
nbdns.RootZone as a raw value which is inconsistent with registerFallback's
domain.Domain(nbdns.RootZone) usage; to make the two call sites uniform and
robust to future typing changes, change the argument passed to
newUpstreamResolver in addHostRootZone to domain.Domain(nbdns.RootZone) (keeping
the rest of the call and assignments to handler, handler.selectedRoutes,
handler.addRace, and s.registerHandler unchanged).
In `@client/internal/dns/upstream.go`:
- Around line 166-170: The function name contextWithupstreamProtocolResult uses
incorrect camelCase; rename it to contextWithUpstreamProtocolResult and update
its comment and all call sites to match; ensure the returned types and the use
with upstreamProtocolKey{} and upstreamProtocolResult remain unchanged and that
usages alongside setUpstreamProtocol are updated so compilation and references
stay consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 603e9512-58c0-4225-9ddd-903bebbf4616
📒 Files selected for processing (12)
client/internal/dns/local/local.goclient/internal/dns/mock_server.goclient/internal/dns/server.goclient/internal/dns/server_test.goclient/internal/dns/upstream.goclient/internal/dns/upstream_android.goclient/internal/dns/upstream_general.goclient/internal/dns/upstream_ios.goclient/internal/dns/upstream_test.goclient/internal/engine.goclient/internal/routemanager/manager.goclient/internal/routemanager/mock.go
💤 Files with no reviewable changes (1)
- client/internal/dns/local/local.go
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
client/internal/dns/upstream.go (2)
73-79:⚠️ Potential issue | 🟡 MinorThe timeout comment still overstates the total cap.
With
raceMinPerUpstreamTimeoutas a floor, a race with enough upstreams can exceedraceMaxTotalTimeoutoverall. Either enforce an outer race timeout or reword the comment to describe the per-upstream floor behavior.Also applies to: 337-341
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 73 - 79, The comment is misleading: because raceMinPerUpstreamTimeout is a floor, many upstreams can cause the total race duration to exceed raceMaxTotalTimeout; fix by enforcing an outer race deadline when running the upstream race (create a parent context with timeout raceMaxTotalTimeout and derive per-upstream timeouts from it while still applying raceMinPerUpstreamTimeout as the floor), and update the comment for raceMaxTotalTimeout/raceMinPerUpstreamTimeout to state that raceMaxTotalTimeout is an overall cap that is now enforced by the parent context while raceMinPerUpstreamTimeout is a per-upstream floor.
308-390:⚠️ Potential issue | 🟠 MajorDo not record race-loser cancellation as upstream failure.
When one race wins,
raceCtxis canceled and slower workers returncontext.Canceled;queryUpstreamstill callsmarkUpstreamFail, so healthy-but-slower groups can be projected as unhealthy. Treat parent/race cancellation as neutral and skip health recording for that path.🛡️ Proposed fix
rm, _, err := u.upstreamClient.exchange(ctx, upstream.String(), r) if err != nil { + if errors.Is(err, context.Canceled) || errors.Is(parentCtx.Err(), context.Canceled) { + return nil, "", nil + } failure := u.handleUpstreamError(err, upstream, startTime) u.markUpstreamFail(upstream, failure.reason) return nil, "", failure }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 308 - 390, The race cancellation path is being recorded as failures; update queryUpstream so it treats parent/race cancellation as neutral: when exchange returns an error that is due to the race being canceled (detect via parentCtx.Err() == context.Canceled or errors.Is(err, context.Canceled)), do NOT call markUpstreamFail and return (nil, "", nil) so callers treat it as neither success nor failure; similarly, where rm == nil || !rm.Response, if ctx.Err() == context.Canceled then skip markUpstreamFail and return (nil, "", nil). Apply these checks inside queryUpstream (referencing parentCtx, ctx, queryUpstream, markUpstreamFail) so tryRace and raceAll will not record race-loser cancellations as upstream failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 306-318: raceAll spawns concurrent workers that all receive the
same *dns.Msg, which leads to unsafe concurrent mutations; fix by cloning the
request for each worker (use dns.Msg.Copy or equivalent) before passing it into
tryRace so each goroutine gets an independent message; likewise update tryRace
and queryUpstream call sites (and any place where queryUpstream fans out, e.g.,
the other referenced blocks) to accept/use and forward a cloned *dns.Msg rather
than sharing the original request.
---
Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 73-79: The comment is misleading: because
raceMinPerUpstreamTimeout is a floor, many upstreams can cause the total race
duration to exceed raceMaxTotalTimeout; fix by enforcing an outer race deadline
when running the upstream race (create a parent context with timeout
raceMaxTotalTimeout and derive per-upstream timeouts from it while still
applying raceMinPerUpstreamTimeout as the floor), and update the comment for
raceMaxTotalTimeout/raceMinPerUpstreamTimeout to state that raceMaxTotalTimeout
is an overall cap that is now enforced by the parent context while
raceMinPerUpstreamTimeout is a per-upstream floor.
- Around line 308-390: The race cancellation path is being recorded as failures;
update queryUpstream so it treats parent/race cancellation as neutral: when
exchange returns an error that is due to the race being canceled (detect via
parentCtx.Err() == context.Canceled or errors.Is(err, context.Canceled)), do NOT
call markUpstreamFail and return (nil, "", nil) so callers treat it as neither
success nor failure; similarly, where rm == nil || !rm.Response, if ctx.Err() ==
context.Canceled then skip markUpstreamFail and return (nil, "", nil). Apply
these checks inside queryUpstream (referencing parentCtx, ctx, queryUpstream,
markUpstreamFail) so tryRace and raceAll will not record race-loser
cancellations as upstream failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1014583b-f251-4af9-8150-c2115c9cdd37
📒 Files selected for processing (12)
client/internal/dns/local/local.goclient/internal/dns/mock_server.goclient/internal/dns/server.goclient/internal/dns/server_test.goclient/internal/dns/upstream.goclient/internal/dns/upstream_android.goclient/internal/dns/upstream_general.goclient/internal/dns/upstream_ios.goclient/internal/dns/upstream_test.goclient/internal/engine.goclient/internal/routemanager/manager.goclient/internal/routemanager/mock.go
💤 Files with no reviewable changes (1)
- client/internal/dns/local/local.go
✅ Files skipped from review due to trivial changes (2)
- client/internal/dns/upstream_ios.go
- client/internal/dns/server.go
🚧 Files skipped from review as they are similar to previous changes (3)
- client/internal/dns/upstream_android.go
- client/internal/routemanager/mock.go
- client/internal/dns/upstream_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
client/internal/dns/upstream.go (1)
367-376:⚠️ Potential issue | 🟠 MajorCheck the parent context before recording a failure.
This still only suppresses errors that wrap
context.Canceled. If a loser is canceled but the transport returns a read/deadline error, Line 375 records a false failure and can poison passive health.🛡️ Proposed fix
if err != nil { // A parent cancellation (e.g., another race won and the coordinator // cancelled the losers) is not an upstream failure. Don't record it // in the health map or it would poison healthy upstreams. - if errors.Is(err, context.Canceled) { + if errors.Is(err, context.Canceled) || errors.Is(parentCtx.Err(), context.Canceled) { return nil, "", &upstreamFailure{upstream: upstream, reason: "canceled"} } failure := u.handleUpstreamError(err, upstream, startTime) u.markUpstreamFail(upstream, failure.reason)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 367 - 376, Before recording a failure, check the parent request context for cancellation and treat any error that occurred because the parent was canceled as a non-upstream failure: if the parent context (e.g., the ctx used to drive this request) is done (ctx.Err() != nil) then return an upstreamFailure with reason "canceled" without calling u.handleUpstreamError or u.markUpstreamFail; otherwise proceed to call u.handleUpstreamError(err, upstream, startTime) and u.markUpstreamFail(upstream, failure.reason). Use the existing symbols err, upstream, startTime, u.handleUpstreamError, and u.markUpstreamFail to locate and implement this check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 345-350: The issue is that ExchangeWithFallback/tryRace mutates
the original dns.Msg (via opt.SetUDPSize) and then reuses it for sequential
upstreams, causing altered EDNS0 UDP size for later attempts; update tryRace so
it clones the original request before each call to queryUpstream (similar to how
raceAll does) — locate tryRace/ExchangeWithFallback and ensure you call the
request cloning function (or use msg.Copy()) to produce a fresh dns.Msg for
every iteration before passing it to queryUpstream, preserving the original
unmutated request for subsequent upstream attempts and keeping truncation/TCP
fallback behavior correct.
---
Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 367-376: Before recording a failure, check the parent request
context for cancellation and treat any error that occurred because the parent
was canceled as a non-upstream failure: if the parent context (e.g., the ctx
used to drive this request) is done (ctx.Err() != nil) then return an
upstreamFailure with reason "canceled" without calling u.handleUpstreamError or
u.markUpstreamFail; otherwise proceed to call u.handleUpstreamError(err,
upstream, startTime) and u.markUpstreamFail(upstream, failure.reason). Use the
existing symbols err, upstream, startTime, u.handleUpstreamError, and
u.markUpstreamFail to locate and implement this check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6b7c8d06-12d0-446b-ad0d-7c15d8c20072
📒 Files selected for processing (3)
client/internal/dns/server.goclient/internal/dns/upstream.goclient/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- client/internal/dns/server.go
b961f01 to
152ba3e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
client/internal/dns/upstream.go (1)
345-350:⚠️ Potential issue | 🟠 MajorClone the request for each sequential upstream attempt.
ExchangeWithFallback/ExchangeWithNetstackmutate EDNS0 in-place, so reusingracross fallback upstreams can make later attempts inherit the first attempt’s capped UDP size.raceAllalready clones per parallel worker; do the same inside the sequential walk.🛠️ Proposed fix
for _, upstream := range group { if ctx.Err() != nil { return raceResult{failures: failures} } - msg, proto, failure := u.queryUpstream(ctx, r, upstream, timeout) + msg, proto, failure := u.queryUpstream(ctx, r.Copy(), upstream, timeout) if failure != nil { failures = append(failures, *failure) continue }Run this read-only check to confirm the dependency version and existing clone usage:
#!/bin/bash set -euo pipefail rg -n 'github\.com/miekg/dns' go.mod rg -n --type=go '\.Copy\(\)|func \(u \*upstreamResolverBase\) tryRace|queryUpstream\(ctx, r' -C3🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 345 - 350, The loop over sequential upstreams reuses the same dns.Msg r which ExchangeWithFallback/ExchangeWithNetstack mutate (EDNS0) causing later attempts to inherit modified UDP size; inside the for _, upstream := range group loop create a fresh clone of the original request (use r.Copy() or equivalent) and pass that clone to queryUpstream instead of r so each sequential attempt uses an unmodified message; update the code paths around queryUpstream and any callers that rely on r to use the cloned msg for each attempt.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/server.go`:
- Around line 1052-1054: The grace window is scaled using len(snap.selected)
(the HAMap bucket count) when computing delay :=
s.warningDelay(len(snap.selected)), which undercounts routes; change the call to
pass the actual total route count by summing the lengths of each bucket in
snap.selected (i.e., compute totalRoutes := sum(len(bucket) for each bucket in
snap.selected)) and call s.warningDelay(totalRoutes) instead so s.warningDelay
receives the real route count; update the code around the variables now, delay,
and snap.selected accordingly.
---
Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 345-350: The loop over sequential upstreams reuses the same
dns.Msg r which ExchangeWithFallback/ExchangeWithNetstack mutate (EDNS0) causing
later attempts to inherit modified UDP size; inside the for _, upstream := range
group loop create a fresh clone of the original request (use r.Copy() or
equivalent) and pass that clone to queryUpstream instead of r so each sequential
attempt uses an unmodified message; update the code paths around queryUpstream
and any callers that rely on r to use the cloned msg for each attempt.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5fb54fad-0818-4a00-b330-dfd5318b5e14
📒 Files selected for processing (4)
client/internal/dns/server.goclient/internal/dns/upstream.goclient/internal/dns/upstream_ios.goclient/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- client/internal/dns/upstream_ios.go
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
client/internal/dns/server.go (1)
1052-1054:⚠️ Potential issue | 🟠 MajorScale the grace window by actual selected-route count.
len(snap.selected)counts HAMap buckets, not the total routes inside those buckets, so the startup grace can be under-sized when many routes share a bucket.🛠️ Proposed fix
- delay := s.warningDelay(len(snap.selected)) + delay := s.warningDelay(haMapRouteCount(snap.selected))+func haMapRouteCount(hm route.HAMap) int { + count := 0 + for _, routes := range hm { + count += len(routes) + } + return count +} + // warningDelay returns the grace window for the given selected-route🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/server.go` around lines 1052 - 1054, The grace-window calculation uses len(snap.selected) (which counts HAMap buckets) when calling s.warningDelay, under-sizing the delay when buckets contain many routes; change it to compute the actual selected-route count by iterating snap.selected's buckets and summing their route counts (e.g., summing len(...) or a Count field for each bucket) and pass that total to s.warningDelay(selectedCount) instead of len(snap.selected); update the variables near now, delay and the s.warningDelay(...) call so delay = s.warningDelay(selectedCount).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/server.go`:
- Around line 1057-1063: projectNSGroupHealth currently classifies group health
using all configured nameservers via nameServerAddrPorts(group.NameServers) even
though handlers only query the subset returned by filterNameServers; this lets
skipped servers make groupHasImmediateUpstream return true and bypass overlay
grace. Fix by creating two variables: configuredServers :=
nameServerAddrPorts(group.NameServers) and servers :=
filterNameServers(configuredServers) (or call nameServerAddrPorts only for
configuredServers and derive the queried subset via filterNameServers), then use
servers when calling evaluateNSGroupHealth and groupHasImmediateUpstream while
storing configuredServers in NSGroupState.Servers so status still shows the full
configured list. Ensure the same change is applied at the other occurrences
mentioned (around lines 919-933 and 1299-1306).
---
Duplicate comments:
In `@client/internal/dns/server.go`:
- Around line 1052-1054: The grace-window calculation uses len(snap.selected)
(which counts HAMap buckets) when calling s.warningDelay, under-sizing the delay
when buckets contain many routes; change it to compute the actual selected-route
count by iterating snap.selected's buckets and summing their route counts (e.g.,
summing len(...) or a Count field for each bucket) and pass that total to
s.warningDelay(selectedCount) instead of len(snap.selected); update the
variables near now, delay and the s.warningDelay(...) call so delay =
s.warningDelay(selectedCount).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f55ca1f7-c83a-43ba-bf87-cf03de47eba4
📒 Files selected for processing (4)
client/internal/dns/server.goclient/internal/dns/upstream.goclient/internal/dns/upstream_ios.goclient/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- client/internal/dns/upstream.go
85178a3 to
10ecf7c
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
client/internal/dns/upstream.go (1)
199-210:⚠️ Potential issue | 🟠 MajorInclude race topology in the handler ID.
ID()flattens and sorts upstreams, but group boundaries and within-group order now change behavior. For example, one race[a,b]is sequential fallback, while two races[a]and[b]run in parallel, yet both hash to the same ID. This can leave stale handler behavior across DNS config updates.🛠️ Proposed fix
func (u *upstreamResolverBase) ID() types.HandlerID { - servers := u.flatUpstreams() - slices.SortFunc(servers, func(a, b netip.AddrPort) int { return a.Compare(b) }) - hash := sha256.New() hash.Write([]byte(u.domain.PunycodeString() + ":")) - for _, s := range servers { - hash.Write([]byte(s.String())) - hash.Write([]byte("|")) + + groups := make([]string, 0, len(u.upstreamServers)) + for _, group := range u.upstreamServers { + var b strings.Builder + for _, s := range group { + b.WriteString(s.String()) + b.WriteByte('|') + } + groups = append(groups, b.String()) + } + slices.Sort(groups) + + for _, group := range groups { + hash.Write([]byte("[")) + hash.Write([]byte(group)) + hash.Write([]byte("]")) } return types.HandlerID("upstream-" + hex.EncodeToString(hash.Sum(nil)[:8])) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 199 - 210, The current ID() uses flatUpstreams() which loses group topology so different race/fallback groupings can produce the same hash; change ID() in upstreamResolverBase to hash the grouped topology instead of the flattened list: iterate the original grouped upstream structure (preserving group boundaries and the within-group order used at run-time), write a distinct separator marker into the SHA256 stream between groups and include each AddrPort entry in each group, then produce the HandlerID from the truncated hash as before; reference the upstreamResolverBase.ID method and avoid using flatUpstreams() so group boundaries and ordering are reflected in the ID.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@client/internal/dns/upstream.go`:
- Around line 199-210: The current ID() uses flatUpstreams() which loses group
topology so different race/fallback groupings can produce the same hash; change
ID() in upstreamResolverBase to hash the grouped topology instead of the
flattened list: iterate the original grouped upstream structure (preserving
group boundaries and the within-group order used at run-time), write a distinct
separator marker into the SHA256 stream between groups and include each AddrPort
entry in each group, then produce the HandlerID from the truncated hash as
before; reference the upstreamResolverBase.ID method and avoid using
flatUpstreams() so group boundaries and ordering are reflected in the ID.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a87a52fc-29c1-45ac-9f66-e1b4a6dd0406
📒 Files selected for processing (5)
client/internal/dns/server.goclient/internal/dns/server_test.goclient/internal/dns/upstream.goclient/internal/dns/upstream_ios.goclient/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- client/internal/dns/server.go
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
client/internal/dns/upstream_ios.go (1)
72-80:⚠️ Potential issue | 🟠 MajorTCP fallback for iOS private DNS client will fail due to mismatched address type.
When an inbound DNS query arrives over TCP and
ExchangeWithFallbackdetects it viadnsProtocolFromContext(ctx), it attempts to exchange via TCP using the client's dialer. The private iOS client at lines 92–98 setsLocalAddrto&net.UDPAddr, which Go'snet.Dialerrejects for TCP dials with "mismatched local address type" (as confirmed by the Go runtime).Use either:
*net.TCPAddrfor TCP exchanges instead of*net.UDPAddr, or- Remove
LocalAddrand rely solely onIP_BOUND_IFfor interface binding (verify this works on iOS with the Control callback).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream_ios.go` around lines 72 - 80, The iOS private client sets Dialer.LocalAddr to a *net.UDPAddr which causes "mismatched local address type" when ExchangeWithFallback (which checks dnsProtocolFromContext(ctx)) falls back to TCP; update GetClientPrivate (or the code path creating the net.Dialer used before calling ExchangeWithFallback) to either set LocalAddr to a *net.TCPAddr when the dial will be TCP (detect via dnsProtocolFromContext(ctx) or pass a protocol flag into GetClientPrivate) or remove LocalAddr entirely and rely only on the Control callback / IP_BOUND_IF for interface binding; ensure the change targets the dialer construction used by ExchangeWithFallback so TCP dials no longer use a UDPAddr.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/server.go`:
- Around line 1331-1336: generateGroupKey builds the nsGroupID using the slices'
current order which makes the key unstable across order-only config changes;
update generateGroupKey to produce a deterministic key by collecting the
NameServers' AddrPort().String() values, normalizing if needed, sorting that
servers slice, sorting nsGroup.Domains as well, then format the sorted servers
and sorted domains into the nsGroupID; this ensures NameServerGroup keys
(nsGroupID) are order-stable and preserves everHealthy/warningActive state.
- Around line 316-323: SetRouteSources currently only updates
DefaultServer.selectedRoutes/activeRoutes but existing upstream handlers created
earlier still hold copied callbacks and stay nil; update SetRouteSources to
propagate the new callbacks to those handlers by iterating the server's upstream
handler registry and calling a setter on each handler's resolver (add a small
method like SetRouteSources or SetSelectedActive on the concrete resolver type
used in this package), using a type assertion to the concrete resolver
implementation to call that setter; alternatively change upstream handlers to
read server.selectedRoutes/activeRoutes via a shared accessor rather than
copying the function so new values are seen immediately.
- Around line 1094-1105: When verdict == nsVerdictUndecided we must clear any
previously held unhealthy streak so future fresh failures don't inherit a stale
unhealthySince; modify the nsVerdictUndecided branch (in the switch handling
verdict) to reset the group's unhealthy streak by setting p.unhealthySince =
time.Time{} (and any related counters if present, e.g. p.unhealthyCount or
p.consecutiveUnhealthy) in addition to preserving the existing warningActive
logic and groupErr = nil so that projectUnhealthy/projectHealthy behavior
remains correct.
---
Outside diff comments:
In `@client/internal/dns/upstream_ios.go`:
- Around line 72-80: The iOS private client sets Dialer.LocalAddr to a
*net.UDPAddr which causes "mismatched local address type" when
ExchangeWithFallback (which checks dnsProtocolFromContext(ctx)) falls back to
TCP; update GetClientPrivate (or the code path creating the net.Dialer used
before calling ExchangeWithFallback) to either set LocalAddr to a *net.TCPAddr
when the dial will be TCP (detect via dnsProtocolFromContext(ctx) or pass a
protocol flag into GetClientPrivate) or remove LocalAddr entirely and rely only
on the Control callback / IP_BOUND_IF for interface binding; ensure the change
targets the dialer construction used by ExchangeWithFallback so TCP dials no
longer use a UDPAddr.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 882717d9-90bb-4fa0-b560-39e3d2ce6289
📒 Files selected for processing (5)
client/internal/dns/server.goclient/internal/dns/server_test.goclient/internal/dns/upstream.goclient/internal/dns/upstream_ios.goclient/internal/dns/upstream_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (3)
client/internal/dns/server.go (3)
316-323:⚠️ Potential issue | 🟠 MajorPropagate route source changes to existing upstream handlers.
SetRouteSourcesupdates only the server fields, while upstream handlers copyselectedRouteswhen created. Handlers created before route-manager wiring, especially permanent/iOS root handlers, can keep a nil callback and misclassify routed upstreams as public or avoid the iOS private client path. Prefer a shared route-source accessor or propagate the new callback to all existing upstream handlers, including host-root/fallback handlers.Also applies to: 792-794, 894-895, 1006-1008
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/server.go` around lines 316 - 323, SetRouteSources currently only updates DefaultServer.selectedRoutes/activeRoutes but does not update already-created upstream handlers that captured the previous callbacks; modify SetRouteSources to either (a) switch handlers to use a shared accessor function stored on DefaultServer (e.g., have handlers call s.selectedRoutes()/s.activeRoutes() at runtime) or (b) iterate all existing upstream handler instances (including host-root and fallback handlers) and set their route source callback fields to the new selected/active functions so they stop using a nil or stale callback; update SetRouteSources to perform this propagation and ensure any types named upstream handler, host-root handler, or fallback handler are updated accordingly.
1334-1339:⚠️ Potential issue | 🟡 MinorMake the NS group key order-stable.
generateGroupKeystill depends on the configured slice order. An order-only refresh for the same group creates a new projection ID, losingeverHealthy/warningActiveand potentially re-emitting warnings.Proposed fix
+ "slices" "strings" "sync" "time"func generateGroupKey(nsGroup *nbdns.NameServerGroup) nsGroupID { var servers []string for _, ns := range nsGroup.NameServers { servers = append(servers, ns.AddrPort().String()) } - return nsGroupID(fmt.Sprintf("%v_%v", servers, nsGroup.Domains)) + slices.Sort(servers) + domains := slices.Clone(nsGroup.Domains) + slices.Sort(domains) + return nsGroupID(fmt.Sprintf("%v_%v", servers, domains)) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/server.go` around lines 1334 - 1339, generateGroupKey builds an unstable key because it depends on the iteration order of nsGroup.NameServers (and potentially nsGroup.Domains); to fix, produce a deterministic key in generateGroupKey by collecting the NameServers addresses into a slice, sorting that slice, also sort nsGroup.Domains (if it's a slice), then format the sorted slices into the nsGroupID so the key is stable across reorderings and preserves everHealthy/warningActive state.
1100-1105:⚠️ Potential issue | 🟠 MajorReset held unhealthy streaks when the verdict becomes undecided.
A routed failure suppressed during the grace window sets
unhealthySince; if that evidence ages out before a warning fires, the stale timestamp is retained and a later fresh failure can emit immediately.Proposed fix
case nsVerdictUndecided: // Stay Available until evidence says otherwise, unless a // warning is already active for this group. enabled = !p.warningActive groupErr = nil + if !p.warningActive { + p.unhealthySince = time.Time{} + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/server.go` around lines 1100 - 1105, When handling nsVerdictUndecided in the switch, reset the held unhealthy streak timestamp so stale `unhealthySince` doesn't cause an immediate warning later; in the nsVerdictUndecided branch (where you set `enabled = !p.warningActive` and `groupErr = nil`), also clear `p.unhealthySince` (e.g., set it to the zero value for its type, such as `time.Time{}` or `nil` if it's a pointer) so any aged suppression is forgotten when the verdict becomes undecided.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@client/internal/dns/server.go`:
- Around line 316-323: SetRouteSources currently only updates
DefaultServer.selectedRoutes/activeRoutes but does not update already-created
upstream handlers that captured the previous callbacks; modify SetRouteSources
to either (a) switch handlers to use a shared accessor function stored on
DefaultServer (e.g., have handlers call s.selectedRoutes()/s.activeRoutes() at
runtime) or (b) iterate all existing upstream handler instances (including
host-root and fallback handlers) and set their route source callback fields to
the new selected/active functions so they stop using a nil or stale callback;
update SetRouteSources to perform this propagation and ensure any types named
upstream handler, host-root handler, or fallback handler are updated
accordingly.
- Around line 1334-1339: generateGroupKey builds an unstable key because it
depends on the iteration order of nsGroup.NameServers (and potentially
nsGroup.Domains); to fix, produce a deterministic key in generateGroupKey by
collecting the NameServers addresses into a slice, sorting that slice, also sort
nsGroup.Domains (if it's a slice), then format the sorted slices into the
nsGroupID so the key is stable across reorderings and preserves
everHealthy/warningActive state.
- Around line 1100-1105: When handling nsVerdictUndecided in the switch, reset
the held unhealthy streak timestamp so stale `unhealthySince` doesn't cause an
immediate warning later; in the nsVerdictUndecided branch (where you set
`enabled = !p.warningActive` and `groupErr = nil`), also clear
`p.unhealthySince` (e.g., set it to the zero value for its type, such as
`time.Time{}` or `nil` if it's a pointer) so any aged suppression is forgotten
when the verdict becomes undecided.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c31a09ca-0bb6-4fc7-ad93-eb026f0fb034
📒 Files selected for processing (5)
client/internal/dns/server.goclient/internal/dns/server_test.goclient/internal/dns/upstream.goclient/internal/dns/upstream_ios.goclient/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- client/internal/dns/upstream_test.go
d01b451 to
d7a3195
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
client/internal/dns/upstream.go (1)
199-210:⚠️ Potential issue | 🟠 MajorInclude race boundaries and fallback order in the handler ID.
Line 201 flattens and sorts upstreams before hashing, but the new model makes
upstreamRaceboundaries and in-race order behaviorally significant. Configs like[[a,b]]vs[[a],[b]], or[a,b]vs[b,a], now produce the same handler ID while resolving differently. Hash the races with delimiters and preserve fallback order.🛠️ Proposed fix
func (u *upstreamResolverBase) ID() types.HandlerID { - servers := u.flatUpstreams() - slices.SortFunc(servers, func(a, b netip.AddrPort) int { return a.Compare(b) }) - hash := sha256.New() hash.Write([]byte(u.domain.PunycodeString() + ":")) - for _, s := range servers { - hash.Write([]byte(s.String())) - hash.Write([]byte("|")) + for _, race := range u.upstreamServers { + hash.Write([]byte("[")) + for _, s := range race { + hash.Write([]byte(s.String())) + hash.Write([]byte(",")) + } + hash.Write([]byte("]")) } return types.HandlerID("upstream-" + hex.EncodeToString(hash.Sum(nil)[:8])) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 199 - 210, The current upstreamResolverBase.ID() flattens and sorts servers via flatUpstreams(), losing upstreamRace boundaries and fallback order; change the ID generation to iterate the resolver's race groups in order (preserve each race's internal order or sort within a race if intended), append a clear delimiter between races (e.g., "[" and "]" or another separator) when writing to the hash, and include the fallback sequence as part of the hashed input so that different race groupings or orderings (e.g., [[a,b]] vs [[a],[b]] or [a,b] vs [b,a]) produce distinct HandlerIDs; update references in upstreamResolverBase.ID(), remove reliance on flatUpstreams() for the hash input, and ensure the final hex.EncodeToString(hash.Sum(nil)[:8]) still returns the ID.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 744-751: haMapContains currently treats any matching entry in
route.HAMap as "routed" even when the match comes from a dynamic (domain-based)
route placeholder Network, causing dynamic-only upstreams to bypass the startup
grace window; update haMapContains to detect dynamic routes (e.g., check route
entries for the dynamic flag or placeholder Network on the
route.Route/route.Entry objects) and return a tri-state or
"unknown/dynamic-selected" result instead of plain true, so callers can apply
the startup grace window for dynamic-only upstreams (or alternatively expose the
resolved dynamic route prefixes to this check and only consider concrete prefix
matches as routed).
---
Outside diff comments:
In `@client/internal/dns/upstream.go`:
- Around line 199-210: The current upstreamResolverBase.ID() flattens and sorts
servers via flatUpstreams(), losing upstreamRace boundaries and fallback order;
change the ID generation to iterate the resolver's race groups in order
(preserve each race's internal order or sort within a race if intended), append
a clear delimiter between races (e.g., "[" and "]" or another separator) when
writing to the hash, and include the fallback sequence as part of the hashed
input so that different race groupings or orderings (e.g., [[a,b]] vs [[a],[b]]
or [a,b] vs [b,a]) produce distinct HandlerIDs; update references in
upstreamResolverBase.ID(), remove reliance on flatUpstreams() for the hash
input, and ensure the final hex.EncodeToString(hash.Sum(nil)[:8]) still returns
the ID.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6ea0f54b-ca5d-44f8-a7f1-ad2ed0f024e1
📒 Files selected for processing (5)
client/internal/dns/server.goclient/internal/dns/server_test.goclient/internal/dns/upstream.goclient/internal/dns/upstream_ios.goclient/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- client/internal/dns/upstream_ios.go
- client/internal/dns/server.go
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
client/internal/dns/upstream.go (1)
199-210:⚠️ Potential issue | 🟠 MajorInclude race grouping and fallback order in the handler ID.
ID()sorts the flattened upstream set, but the new model makes both group boundaries and in-race order semantically meaningful. Configs like[A,B]vs[B,A], or one race[A,B]vs two races[A] [B], currently hash to the same handler ID and can prevent a changed resolver topology from being treated as a distinct handler.Suggested fix
func (u *upstreamResolverBase) ID() types.HandlerID { - servers := u.flatUpstreams() - slices.SortFunc(servers, func(a, b netip.AddrPort) int { return a.Compare(b) }) - hash := sha256.New() hash.Write([]byte(u.domain.PunycodeString() + ":")) - for _, s := range servers { - hash.Write([]byte(s.String())) - hash.Write([]byte("|")) + for _, race := range u.upstreamServers { + hash.Write([]byte("[")) + for _, s := range race { + hash.Write([]byte(s.String())) + hash.Write([]byte(",")) + } + hash.Write([]byte("]")) } return types.HandlerID("upstream-" + hex.EncodeToString(hash.Sum(nil)[:8])) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 199 - 210, The ID() currently hashes the flattened server list from u.flatUpstreams(), losing race-group boundaries and fallback ordering; update upstreamResolverBase.ID() to iterate the actual upstream grouping structure (preserve group boundaries and the order of servers within each race and the order of races) instead of using flatUpstreams(), and include explicit separators/tokens when writing to the hash (e.g., start/end-of-group markers and an inter-server separator) so different configurations like [A,B] vs [B,A] or [A][B] produce distinct HandlerIDs while still incorporating u.domain.PunycodeString().
♻️ Duplicate comments (1)
client/internal/dns/upstream.go (1)
758-765:⚠️ Potential issue | 🟠 MajorDynamic-route upstreams still bypass the startup grace path.
The documented gap means DNS upstreams reachable only through selected dynamic routes are classified as “not routed,” so failures emit immediately instead of using the startup grace window. That preserves a false-positive warning path this PR is trying to remove; consider a tri-state route classification or feeding resolved dynamic prefixes into this check.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 758 - 765, haMapContains currently treats any route with a placeholder Network (domain-based/dynamic routes) as "not routed" which causes dynamic-only upstreams to skip the startup grace window; update haMapContains (and related use of route.HAMap) to distinguish three states (routed, not routed, unknown/dynamic) by detecting dynamic/domain routes (e.g. via route.Route flags or route.IsDynamic/Domain fields) and either (a) consult resolved dynamic prefixes (pass in or query a resolved-prefixes map from the dynamic resolver) and treat matching resolved prefixes as routed, or (b) return an "unknown" result that callers can treat as routed-for-grace; adjust callers of haMapContains accordingly so dynamic-only upstreams are not immediately marked failing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@client/internal/dns/upstream.go`:
- Around line 199-210: The ID() currently hashes the flattened server list from
u.flatUpstreams(), losing race-group boundaries and fallback ordering; update
upstreamResolverBase.ID() to iterate the actual upstream grouping structure
(preserve group boundaries and the order of servers within each race and the
order of races) instead of using flatUpstreams(), and include explicit
separators/tokens when writing to the hash (e.g., start/end-of-group markers and
an inter-server separator) so different configurations like [A,B] vs [B,A] or
[A][B] produce distinct HandlerIDs while still incorporating
u.domain.PunycodeString().
---
Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 758-765: haMapContains currently treats any route with a
placeholder Network (domain-based/dynamic routes) as "not routed" which causes
dynamic-only upstreams to skip the startup grace window; update haMapContains
(and related use of route.HAMap) to distinguish three states (routed, not
routed, unknown/dynamic) by detecting dynamic/domain routes (e.g. via
route.Route flags or route.IsDynamic/Domain fields) and either (a) consult
resolved dynamic prefixes (pass in or query a resolved-prefixes map from the
dynamic resolver) and treat matching resolved prefixes as routed, or (b) return
an "unknown" result that callers can treat as routed-for-grace; adjust callers
of haMapContains accordingly so dynamic-only upstreams are not immediately
marked failing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9bfbddb5-aeee-435a-855f-e6b2f36398c7
📒 Files selected for processing (5)
client/internal/dns/server.goclient/internal/dns/server_test.goclient/internal/dns/upstream.goclient/internal/dns/upstream_ios.goclient/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- client/internal/dns/upstream_ios.go
- client/internal/dns/server.go
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
client/internal/dns/upstream.go (1)
558-570:⚠️ Potential issue | 🟠 MajorCopy the DNS client before setting
UDPSize.
ExchangeWithFallbackmutates the caller-provided*dns.Clientat line 565. Under a parallel race model, a shared upstream client can race across concurrent exchanges or leak this per-request UDP size into subsequent requests. MirrortoTCPClientby copying the client for the UDP exchange path.🛠️ Proposed fix
clientMaxSize := clientUDPMaxSize(r) + udpClient := *client // Cap EDNS0 to our tunnel MTU so the upstream doesn't send a // response larger than our read buffer. // Note: the query could be sent out on an interface that is not ours, // but higher MTU settings could break truncation handling. maxUDPPayload := uint16(currentMTU - ipUDPHeaderSize) - client.UDPSize = maxUDPPayload + udpClient.UDPSize = maxUDPPayload if opt := r.IsEdns0(); opt != nil && opt.UDPSize() > maxUDPPayload { opt.SetUDPSize(maxUDPPayload) } - rm, t, err := client.ExchangeContext(ctx, r, upstream) + rm, t, err := udpClient.ExchangeContext(ctx, r, upstream) if err != nil { return nil, t, fmt.Errorf("with udp: %w", err) } - rm, t, err = toTCPClient(client).ExchangeContext(ctx, r, upstream) + rm, t, err = toTCPClient(&udpClient).ExchangeContext(ctx, r, upstream)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 558 - 570, The upstream DNS client is being mutated (client.UDPSize and EDNS0 UDP size) before calling ExchangeContext/ExchangeWithFallback, which can race when the same *dns.Client is shared; fix by making a copy of the client before mutating and use that copy for the UDP exchange path (mirror the toTCPClient approach): create a local copy of the dns.Client value, set copy.UDPSize and adjust EDNS0 on the copy, then call ExchangeContext/ExchangeWithFallback with © instead of the original client to avoid leaking per-request UDPSize across concurrent requests.
♻️ Duplicate comments (1)
client/internal/dns/upstream.go (1)
769-776:⚠️ Potential issue | 🟠 MajorDynamic-route upstreams still bypass the startup grace path.
The documented gap means DNS upstreams reachable only via selected dynamic routes are classified as “not routed,” so failures warn immediately instead of using the overlay startup grace window. This keeps one false-positive path open for the PR’s main objective.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/dns/upstream.go` around lines 769 - 776, haMapContains currently treats upstreams reachable only via dynamic (domain-based) routes as "not routed" because it only matches concrete Networks, causing those upstreams to bypass the startup grace window; update haMapContains to also consider dynamic routes as a match — e.g., iterate the route.HAMap entries and treat entries whose route.Network is the dynamic placeholder (or which are marked dynamic) as a valid match for the given ip so they are classified as routed for the DNS health startup grace path; reference haMapContains, route.HAMap, route.Network (placeholder/dynamic marker) and the ip parameter when locating and changing the matching logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@client/internal/dns/upstream.go`:
- Around line 558-570: The upstream DNS client is being mutated (client.UDPSize
and EDNS0 UDP size) before calling ExchangeContext/ExchangeWithFallback, which
can race when the same *dns.Client is shared; fix by making a copy of the client
before mutating and use that copy for the UDP exchange path (mirror the
toTCPClient approach): create a local copy of the dns.Client value, set
copy.UDPSize and adjust EDNS0 on the copy, then call
ExchangeContext/ExchangeWithFallback with © instead of the original client
to avoid leaking per-request UDPSize across concurrent requests.
---
Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 769-776: haMapContains currently treats upstreams reachable only
via dynamic (domain-based) routes as "not routed" because it only matches
concrete Networks, causing those upstreams to bypass the startup grace window;
update haMapContains to also consider dynamic routes as a match — e.g., iterate
the route.HAMap entries and treat entries whose route.Network is the dynamic
placeholder (or which are marked dynamic) as a valid match for the given ip so
they are classified as routed for the DNS health startup grace path; reference
haMapContains, route.HAMap, route.Network (placeholder/dynamic marker) and the
ip parameter when locating and changing the matching logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c0a71c83-1174-4b5a-b49f-9153607f45bb
📒 Files selected for processing (5)
client/internal/dns/server.goclient/internal/dns/server_test.goclient/internal/dns/upstream.goclient/internal/dns/upstream_ios.goclient/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- client/internal/dns/upstream_ios.go
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/dns/upstream_test.go`:
- Around line 439-441: The comment incorrectly states "the broken group's
timeout (100ms)"—update the test comment near the assertion that checks elapsed
< 100*time.Millisecond to accurately reflect that 100ms is the assertion
threshold proving parallelism, not the configured broken-group timeout; mention
the actual per-upstream timeout variable upstreamTimeout (250ms) and the mock
delay (500ms) or reword to: "the assertion threshold (100ms) proves parallelism;
the broken group's configured timeout is upstreamTimeout (250ms) and its mock
delay is 500ms" so readers won't confuse the numbers.
🪄 Autofix (Beta)
❌ Autofix failed (check again to retry)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 927c4a2d-61d8-427d-9483-59940fd674b9
📒 Files selected for processing (5)
client/internal/dns/server.goclient/internal/dns/server_test.goclient/internal/dns/upstream.goclient/internal/dns/upstream_ios.goclient/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- client/internal/dns/server.go
| // Working group answers in a single RTT; the broken group's | ||
| // timeout (100ms) must not block the response. | ||
| assert.Less(t, elapsed, 100*time.Millisecond, "race must not wait for broken group's timeout") |
There was a problem hiding this comment.
Comment doesn't match the configured values.
The inline comment says the broken group's timeout (100ms), but the broken group's actual per-upstream timeout is upstreamTimeout = 250ms (line 417) and its mock delay is 500ms (line 405). The 100 * time.Millisecond here is just the assertion threshold used to prove parallelism, not a configured timeout.
✏️ Suggested wording
- // Working group answers in a single RTT; the broken group's
- // timeout (100ms) must not block the response.
+ // Working group answers in a single RTT. Assert well below the broken
+ // group's 250ms per-upstream timeout (and its 500ms mock delay) to
+ // prove races run in parallel, not serially.
assert.Less(t, elapsed, 100*time.Millisecond, "race must not wait for broken group's timeout")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Working group answers in a single RTT; the broken group's | |
| // timeout (100ms) must not block the response. | |
| assert.Less(t, elapsed, 100*time.Millisecond, "race must not wait for broken group's timeout") | |
| // Working group answers in a single RTT. Assert well below the broken | |
| // group's 250ms per-upstream timeout (and its 500ms mock delay) to | |
| // prove races run in parallel, not serially. | |
| assert.Less(t, elapsed, 100*time.Millisecond, "race must not wait for broken group's timeout") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@client/internal/dns/upstream_test.go` around lines 439 - 441, The comment
incorrectly states "the broken group's timeout (100ms)"—update the test comment
near the assertion that checks elapsed < 100*time.Millisecond to accurately
reflect that 100ms is the assertion threshold proving parallelism, not the
configured broken-group timeout; mention the actual per-upstream timeout
variable upstreamTimeout (250ms) and the mock delay (500ms) or reword to: "the
assertion threshold (100ms) proves parallelism; the broken group's configured
timeout is upstreamTimeout (250ms) and its mock delay is 500ms" so readers won't
confuse the numbers.
|
Note Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it. ❌ Failed to clone repository into sandbox. Please try again. |
|
Note Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it. ❌ Failed to clone repository into sandbox. Please try again. |



Describe your changes
Replaces the synthetic DNS probe with a passive health model driven by real query outcomes, and merges overlapping nameserver groups into a single handler that races them in parallel.
ProbeAvailability,NB_SKIP_DNS_PROBE, exponential-backoff reactivation). No more synthetic SOA traffic and no more deactivate/reactivate dance on upstream failure.UpstreamHealth) on every query and project them intoNSGroupStateevery 10s: a group is Healthy if any seen upstream has a recent success, Unhealthy if any has a recent failure with no fresher success, Undecided otherwise.GetActiveClientRouteson route-manager so the DNS layer can tell "routed but no live peer" apart from "routed and reachable".Stopso a subsequent Start doesn't inherit sticky flags (everHealthy, warningActive) from the prior session.Addresses the long-standing "Unable to reach one or more DNS servers" false-positive reported in #5727 and the community forum. Supersedes #5625 (which took the narrower approach of skipping the probe for overlay IPs).
Handler layout
Two nameserver groups targeting
example.com:collapse into one handler with two inner races:
Per query: A and B run concurrently. A tries 10.0.0.1, falls back to 10.0.0.2 on failure. First valid answer from either race wins; the loser is cancelled.
Health event transitions
Issue ticket number and link
Closes #5727
Supersedes #5625
Stack
Checklist
Documentation
Select exactly one:
Internal refactor of client-side DNS health handling. No user-facing configuration or API change.
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
New Features
Improvements
Tests