Skip to content

[client] Drop DNS probes for passive health projection#5971

Open
lixmal wants to merge 2 commits intomainfrom
drop-dns-probes
Open

[client] Drop DNS probes for passive health projection#5971
lixmal wants to merge 2 commits intomainfrom
drop-dns-probes

Conversation

@lixmal
Copy link
Copy Markdown
Collaborator

@lixmal lixmal commented Apr 23, 2026

Describe your changes

Replaces the synthetic DNS probe with a passive health model driven by real query outcomes, and merges overlapping nameserver groups into a single handler that races them in parallel.

  • Drop active probes (ProbeAvailability, NB_SKIP_DNS_PROBE, exponential-backoff reactivation). No more synthetic SOA traffic and no more deactivate/reactivate dance on upstream failure.
  • Merge every nameserver group targeting the same domain into one handler with multiple inner "races". Within a race upstreams are tried in order; when more than one race exists they fan out in parallel and the first valid answer wins.
  • Record per-upstream outcomes (UpstreamHealth) on every query and project them into NSGroupState every 10s: a group is Healthy if any seen upstream has a recent success, Unhealthy if any has a recent failure with no fresher success, Undecided otherwise.
  • Suppress the "Nameserver group unreachable" event during peer startup: if every upstream is reachable only through the overlay (routed prefix or the WG subnet itself) and no peer has yet reported Connected for it, hold the warning for a grace window that scales with the number of selected routes. Public DNS, connected-overlay DNS, and previously-healthy groups fire immediately.
  • Add GetActiveClientRoutes on route-manager so the DNS layer can tell "routed but no live peer" apart from "routed and reachable".
  • Clear health projection state on Stop so a subsequent Start doesn't inherit sticky flags (everHealthy, warningActive) from the prior session.

Addresses the long-standing "Unable to reach one or more DNS servers" false-positive reported in #5727 and the community forum. Supersedes #5625 (which took the narrower approach of skipping the probe for overlay IPs).

Handler layout

Two nameserver groups targeting example.com:

Group Upstreams (in order)
A 10.0.0.1, 10.0.0.2
B 10.0.1.1

collapse into one handler with two inner races:

upstreamResolverBase(example.com)
  ├── race A: [10.0.0.1, 10.0.0.2]   // sequential fallback
  └── race B: [10.0.1.1]             // parallel with A

Per query: A and B run concurrently. A tries 10.0.0.1, falls back to 10.0.0.2 on failure. First valid answer from either race wins; the loser is cancelled.

Health event transitions

Current verdict Next verdict Warning event Recovery event
Healthy / Undecided Unhealthy (public or connected-overlay upstream) emit --
Healthy / Undecided Unhealthy (overlay only, no connected peer, never healthy) held for grace window --
Healthy Unhealthy (overlay only, no connected peer, was healthy before) emit --
Unhealthy (warning emitted) Unhealthy silent --
Unhealthy (warning emitted) Healthy -- emit
Unhealthy (warning held) Healthy -- silent

Issue ticket number and link

Closes #5727
Supersedes #5625

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Internal refactor of client-side DNS health handling. No user-facing configuration or API change.

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • New Features

    • DNS queries now race across nameserver groups and return the first successful response.
    • Routing now uses both "selected" and a new "active" route source.
  • Improvements

    • Upstream health is recorded from real query outcomes and projected into per-group states with timed warnings and recovery events.
    • Nameserver groups for the same domain are merged; projection state is cleared on stop.
    • Eager probing removed in favor of continuous health projection.
  • Tests

    • Expanded tests for race behavior, health projection, event timing, and recovery.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Removed probe-driven upstream availability APIs and eager probing; introduced per-domain concurrent NS-group "race" resolution with per-upstream health snapshots, a periodic NS-group health projection (grace/streak rules and event emission), route-source wiring (selected+active), and corresponding test and mock updates.

Changes

Cohort / File(s) Summary
DNS server & local resolver
client/internal/dns/server.go, client/internal/dns/local/local.go, client/internal/dns/mock_server.go
Removed per-upstream ProbeAvailability; removed eager probe orchestration; replaced SetRouteChecker with SetRouteSources(selected, active func() route.HAMap) and updated mock implementations.
Upstream core & adapters
client/internal/dns/upstream.go, client/internal/dns/upstream_*.go, client/internal/dns/upstream_android.go, client/internal/dns/upstream_general.go, client/internal/dns/upstream_ios.go
Reworked to per-domain ordered "races" (concurrent groups, serial attempts within a race), cloned per-attempt requests, protocol-aware exchange results, removed probe/disable/reactivation/backoff logic, added per-upstream health recording and accessor, and changed resolver constructors to accept domain.Domain.
NS-group health projection
client/internal/dns/server.go (health sections)
Added periodic health refresh loop, health lookback window, streak/everHealthy tracking, immediate-vs-grace classification using selected/active route maps, and deterministic WARNING/RECOVERY SystemEvent emission; clears projection state on Stop().
Tests & mocks
client/internal/dns/server_test.go, client/internal/dns/upstream_test.go, client/internal/dns/mock_server.go, client/internal/routemanager/mock.go
Refactored tests for race semantics and projection behavior; added cancellable per-upstream mocks, projection test harness and assertions, updated mocks for SetRouteSources and GetActiveClientRoutesFunc, and made UDP-size capture concurrency-safe.
Engine & routing API
client/internal/engine.go, client/internal/routemanager/manager.go
Engine now wires route sources via SetRouteSources(selected, active) and no longer triggers immediate DNS probes after network-map updates; added GetActiveClientRoutes() to Manager and DefaultManager.
Minor/adapter updates
client/internal/dns/upstream_android.go, client/internal/dns/upstream_general.go, client/internal/dns/upstream_ios.go
Changed newUpstreamResolver signatures to accept typed domain.Domain; iOS upstream routing decision now consults selectedRoutes() and passes ctx into fallback exchange.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant DNSServer
    participant RouteSources as RouteSources(selected, active)
    participant NSGroupA as NSGroupRaceA
    participant NSGroupB as NSGroupRaceB
    participant Upstream as UpstreamResolver

    Client->>DNSServer: DNS Query
    DNSServer->>RouteSources: selected(), active()
    DNSServer->>NSGroupA: start race (uses selected/active)
    DNSServer->>NSGroupB: start race (uses selected/active)

    alt NSGroupA wins
        NSGroupA->>Upstream: query upstream(s)
        Upstream-->>NSGroupA: Response + protocol (record UpstreamHealth)
        NSGroupA-->>DNSServer: Response
        DNSServer->>NSGroupB: cancel
        DNSServer-->>Client: Answer
    else NSGroupB wins
        NSGroupB->>Upstream: query upstream(s)
        Upstream-->>NSGroupB: Response + protocol (record UpstreamHealth)
        NSGroupB-->>DNSServer: Response
        DNSServer->>NSGroupA: cancel
        DNSServer-->>Client: Answer
    else All fail
        DNSServer-->>Client: SERVFAIL
    end

    DNSServer->>DNSServer: periodic refreshHealth tick
    DNSServer->>DNSServer: collect UpstreamHealth snapshots
    DNSServer->>DNSServer: projectNSGroupHealth -> emit WARNING/RECOVERY events
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • pappz

Poem

🐇 I hop where races race and packets roam,
I tally health and hush a false alarm,
I wait for routes, then let the swift reply,
I cancel slowfoot friends, let answers fly.
🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.26% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main architectural change: removing active DNS probes and replacing them with a passive health model based on real query outcomes.
Description check ✅ Passed The description comprehensively covers the changes, objectives, issue closure, and documentation status according to the template structure.
Linked Issues check ✅ Passed The code changes directly address #5727 by eliminating active probes, implementing passive health tracking, adding grace windows for startup warnings, and preventing sticky state across restarts.
Out of Scope Changes check ✅ Passed All changes are scoped to DNS health projection, upstream merging, route-source wiring, and health state management—all aligned with the PR's core objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch drop-dns-probes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
client/internal/dns/upstream_ios.go (1)

48-79: ⚠️ Potential issue | 🟠 Major

Propagate ctx so iOS race losers can be cancelled.

Line 79 still passes nil into ExchangeWithFallback, so the new race cancellation cannot interrupt in-flight iOS upstream exchanges after another race wins.

🐛 Proposed fix
-	return ExchangeWithFallback(nil, client, r, upstream)
+	return ExchangeWithFallback(ctx, client, r, upstream)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream_ios.go` around lines 48 - 79, The exchange
method in upstreamResolverIOS currently passes nil as the context to
ExchangeWithFallback which prevents cancellation of iOS in-flight upstream
queries; update the call in exchange (the final return line) to pass the
incoming ctx instead of nil so ExchangeWithFallback and any underlying network
operations can observe cancellation; ensure you only change the argument to
ExchangeWithFallback (in upstreamResolverIOS.exchange) from nil to ctx and
verify that ExchangeWithFallback's signature accepts a context (update its
signature if needed).
🧹 Nitpick comments (4)
client/internal/dns/server.go (2)

1125-1135: "routes" in the doc actually counts HA groups, not routes.

len(snap.selected) is the number of HAUniqueID keys in the route.HAMap, not the number of Route entries under them. In practice each HA group corresponds to one logical route so the behaviour is fine, but the comment's "+1s per 100 routes" reads as counting individual routes. Consider tightening the comment to "per 100 selected HA groups" to prevent confusion if someone later revisits scaling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/server.go` around lines 1125 - 1135, The comment for
warningDelay is misleading: the integer passed (routeCount) represents selected
HA groups (HAUniqueID keys), not individual Route entries; update the function
comment above warningDelay (and any mention of "+1s per 100 routes") to say "per
100 selected HA groups" (or "per 100 HAUniqueID entries") and clarify that
len(snap.selected) counts HA groups so readers won't confuse it with total Route
entries; keep the rest of the docstring intact and reference the function name
warningDelay and the routeCount parameter in the updated wording.

961-984: Minor inconsistency with registerFallback for the domain arg.

registerFallback (line 781) passes domain.Domain(nbdns.RootZone) explicitly, but here nbdns.RootZone is passed bare to newUpstreamResolver. It compiles today because the const is untyped, but the two call sites should match for grep-ability and to stay robust if the const ever gets a named type.

-		nbdns.RootZone,
+		domain.Domain(nbdns.RootZone),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/server.go` around lines 961 - 984, In addHostRootZone,
newUpstreamResolver is called with nbdns.RootZone as a raw value which is
inconsistent with registerFallback's domain.Domain(nbdns.RootZone) usage; to
make the two call sites uniform and robust to future typing changes, change the
argument passed to newUpstreamResolver in addHostRootZone to
domain.Domain(nbdns.RootZone) (keeping the rest of the call and assignments to
handler, handler.selectedRoutes, handler.addRace, and s.registerHandler
unchanged).
client/internal/dns/upstream.go (1)

166-170: Exported-style camelCase typo in identifier name.

contextWithupstreamProtocolResult should be contextWithUpstreamProtocolResult (capital U after "With") per Go naming conventions — matches the style already used by setUpstreamProtocol just below.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 166 - 170, The function name
contextWithupstreamProtocolResult uses incorrect camelCase; rename it to
contextWithUpstreamProtocolResult and update its comment and all call sites to
match; ensure the returned types and the use with upstreamProtocolKey{} and
upstreamProtocolResult remain unchanged and that usages alongside
setUpstreamProtocol are updated so compilation and references stay consistent.
client/internal/dns/server_test.go (1)

998-1005: ProbeAvailability method is now dead code.

handlerWithStop no longer requires ProbeAvailability after this PR (it's removed from server.go). The method on mockHandler can be dropped to avoid giving the false impression that probing is still part of the handler contract.

 func (m *mockHandler) ServeDNS(dns.ResponseWriter, *dns.Msg) {}
 func (m *mockHandler) Stop()                                 {}
-func (m *mockHandler) ProbeAvailability(context.Context)     {}
 func (m *mockHandler) ID() types.HandlerID                   { return types.HandlerID(m.Id) }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/server_test.go` around lines 998 - 1005, The mockHandler
struct still defines a ProbeAvailability method even though handlerWithStop no
longer requires probing; remove the dead method to avoid implying it’s part of
the handler contract. Edit the mockHandler implementation in server_test.go and
delete the func (m *mockHandler) ProbeAvailability(context.Context) {}
declaration, leaving ServeDNS, Stop, and ID intact (mockHandler, ServeDNS, Stop,
ID are the symbols to locate); run tests to confirm no reference to
ProbeAvailability remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/upstream_test.go`:
- Around line 391-440: The test currently uses timeoutErr returned immediately
from mockUpstreamResolverPerServer, so it doesn't verify true concurrency;
modify the mock response for the broken server (mockUpstreamResponse for
broken.String()) to block until the resolver's context is cancelled or its
per-query timeout elapses (e.g., sleep/select on ctx.Done()) instead of
returning timeoutErr immediately, so upstreamResolverBase.ServeDNS (and addRace
setup) must observe the blocking behavior and the working group's quick response
wins the race; ensure the mock uses the resolver's ctx or a derived context to
unblock on cancellation/timeouts rather than returning an immediate error.

In `@client/internal/dns/upstream.go`:
- Around line 73-79: The comment is wrong: using raceMinPerUpstreamTimeout as a
floor can let N upstreams exceed raceMaxTotalTimeout; update tryRace to enforce
the total cap by wrapping the entire race loop in an outer
context.WithTimeout(ctx, raceMaxTotalTimeout) and compute per-upstream timeout
as per := max(raceMaxTotalTimeout/numUpstreams, raceMinPerUpstreamTimeout) (use
symbols raceMaxTotalTimeout, raceMinPerUpstreamTimeout, and tryRace to locate
the code), and apply that per to each upstream request; alternatively, if you
prefer minimal change, reword the comment near raceMaxTotalTimeout and
raceMinPerUpstreamTimeout to state the floor may allow total >
raceMaxTotalTimeout for many upstreams and note the intended behavior at the
other occurrence of the same logic around tryRace.
- Around line 308-390: raceAll cancels the shared raceCtx when a winner is
found, causing queryUpstream to see context.Canceled and currently record those
as failures via handleUpstreamError/markUpstreamFail; change queryUpstream to
detect when the provided context was canceled (ctx.Err() == context.Canceled or
errors.Is(err, context.Canceled)) and in that case do NOT call markUpstreamFail
or record a failure—return no failure (nil) so losing-but-cancelled upstreams
don't poison the health map; keep all existing behavior for real errors/timeouts
and continue to call handleUpstreamError/markUpstreamFail for those.

---

Outside diff comments:
In `@client/internal/dns/upstream_ios.go`:
- Around line 48-79: The exchange method in upstreamResolverIOS currently passes
nil as the context to ExchangeWithFallback which prevents cancellation of iOS
in-flight upstream queries; update the call in exchange (the final return line)
to pass the incoming ctx instead of nil so ExchangeWithFallback and any
underlying network operations can observe cancellation; ensure you only change
the argument to ExchangeWithFallback (in upstreamResolverIOS.exchange) from nil
to ctx and verify that ExchangeWithFallback's signature accepts a context
(update its signature if needed).

---

Nitpick comments:
In `@client/internal/dns/server_test.go`:
- Around line 998-1005: The mockHandler struct still defines a ProbeAvailability
method even though handlerWithStop no longer requires probing; remove the dead
method to avoid implying it’s part of the handler contract. Edit the mockHandler
implementation in server_test.go and delete the func (m *mockHandler)
ProbeAvailability(context.Context) {} declaration, leaving ServeDNS, Stop, and
ID intact (mockHandler, ServeDNS, Stop, ID are the symbols to locate); run tests
to confirm no reference to ProbeAvailability remains.

In `@client/internal/dns/server.go`:
- Around line 1125-1135: The comment for warningDelay is misleading: the integer
passed (routeCount) represents selected HA groups (HAUniqueID keys), not
individual Route entries; update the function comment above warningDelay (and
any mention of "+1s per 100 routes") to say "per 100 selected HA groups" (or
"per 100 HAUniqueID entries") and clarify that len(snap.selected) counts HA
groups so readers won't confuse it with total Route entries; keep the rest of
the docstring intact and reference the function name warningDelay and the
routeCount parameter in the updated wording.
- Around line 961-984: In addHostRootZone, newUpstreamResolver is called with
nbdns.RootZone as a raw value which is inconsistent with registerFallback's
domain.Domain(nbdns.RootZone) usage; to make the two call sites uniform and
robust to future typing changes, change the argument passed to
newUpstreamResolver in addHostRootZone to domain.Domain(nbdns.RootZone) (keeping
the rest of the call and assignments to handler, handler.selectedRoutes,
handler.addRace, and s.registerHandler unchanged).

In `@client/internal/dns/upstream.go`:
- Around line 166-170: The function name contextWithupstreamProtocolResult uses
incorrect camelCase; rename it to contextWithUpstreamProtocolResult and update
its comment and all call sites to match; ensure the returned types and the use
with upstreamProtocolKey{} and upstreamProtocolResult remain unchanged and that
usages alongside setUpstreamProtocol are updated so compilation and references
stay consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 603e9512-58c0-4225-9ddd-903bebbf4616

📥 Commits

Reviewing files that changed from the base of the PR and between 5da05ec and 4abea7c.

📒 Files selected for processing (12)
  • client/internal/dns/local/local.go
  • client/internal/dns/mock_server.go
  • client/internal/dns/server.go
  • client/internal/dns/server_test.go
  • client/internal/dns/upstream.go
  • client/internal/dns/upstream_android.go
  • client/internal/dns/upstream_general.go
  • client/internal/dns/upstream_ios.go
  • client/internal/dns/upstream_test.go
  • client/internal/engine.go
  • client/internal/routemanager/manager.go
  • client/internal/routemanager/mock.go
💤 Files with no reviewable changes (1)
  • client/internal/dns/local/local.go

Comment thread client/internal/dns/upstream_test.go
Comment thread client/internal/dns/upstream.go
Comment thread client/internal/dns/upstream.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
client/internal/dns/upstream.go (2)

73-79: ⚠️ Potential issue | 🟡 Minor

The timeout comment still overstates the total cap.

With raceMinPerUpstreamTimeout as a floor, a race with enough upstreams can exceed raceMaxTotalTimeout overall. Either enforce an outer race timeout or reword the comment to describe the per-upstream floor behavior.

Also applies to: 337-341

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 73 - 79, The comment is
misleading: because raceMinPerUpstreamTimeout is a floor, many upstreams can
cause the total race duration to exceed raceMaxTotalTimeout; fix by enforcing an
outer race deadline when running the upstream race (create a parent context with
timeout raceMaxTotalTimeout and derive per-upstream timeouts from it while still
applying raceMinPerUpstreamTimeout as the floor), and update the comment for
raceMaxTotalTimeout/raceMinPerUpstreamTimeout to state that raceMaxTotalTimeout
is an overall cap that is now enforced by the parent context while
raceMinPerUpstreamTimeout is a per-upstream floor.

308-390: ⚠️ Potential issue | 🟠 Major

Do not record race-loser cancellation as upstream failure.

When one race wins, raceCtx is canceled and slower workers return context.Canceled; queryUpstream still calls markUpstreamFail, so healthy-but-slower groups can be projected as unhealthy. Treat parent/race cancellation as neutral and skip health recording for that path.

🛡️ Proposed fix
 	rm, _, err := u.upstreamClient.exchange(ctx, upstream.String(), r)
 
 	if err != nil {
+		if errors.Is(err, context.Canceled) || errors.Is(parentCtx.Err(), context.Canceled) {
+			return nil, "", nil
+		}
 		failure := u.handleUpstreamError(err, upstream, startTime)
 		u.markUpstreamFail(upstream, failure.reason)
 		return nil, "", failure
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 308 - 390, The race
cancellation path is being recorded as failures; update queryUpstream so it
treats parent/race cancellation as neutral: when exchange returns an error that
is due to the race being canceled (detect via parentCtx.Err() ==
context.Canceled or errors.Is(err, context.Canceled)), do NOT call
markUpstreamFail and return (nil, "", nil) so callers treat it as neither
success nor failure; similarly, where rm == nil || !rm.Response, if ctx.Err() ==
context.Canceled then skip markUpstreamFail and return (nil, "", nil). Apply
these checks inside queryUpstream (referencing parentCtx, ctx, queryUpstream,
markUpstreamFail) so tryRace and raceAll will not record race-loser
cancellations as upstream failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 306-318: raceAll spawns concurrent workers that all receive the
same *dns.Msg, which leads to unsafe concurrent mutations; fix by cloning the
request for each worker (use dns.Msg.Copy or equivalent) before passing it into
tryRace so each goroutine gets an independent message; likewise update tryRace
and queryUpstream call sites (and any place where queryUpstream fans out, e.g.,
the other referenced blocks) to accept/use and forward a cloned *dns.Msg rather
than sharing the original request.

---

Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 73-79: The comment is misleading: because
raceMinPerUpstreamTimeout is a floor, many upstreams can cause the total race
duration to exceed raceMaxTotalTimeout; fix by enforcing an outer race deadline
when running the upstream race (create a parent context with timeout
raceMaxTotalTimeout and derive per-upstream timeouts from it while still
applying raceMinPerUpstreamTimeout as the floor), and update the comment for
raceMaxTotalTimeout/raceMinPerUpstreamTimeout to state that raceMaxTotalTimeout
is an overall cap that is now enforced by the parent context while
raceMinPerUpstreamTimeout is a per-upstream floor.
- Around line 308-390: The race cancellation path is being recorded as failures;
update queryUpstream so it treats parent/race cancellation as neutral: when
exchange returns an error that is due to the race being canceled (detect via
parentCtx.Err() == context.Canceled or errors.Is(err, context.Canceled)), do NOT
call markUpstreamFail and return (nil, "", nil) so callers treat it as neither
success nor failure; similarly, where rm == nil || !rm.Response, if ctx.Err() ==
context.Canceled then skip markUpstreamFail and return (nil, "", nil). Apply
these checks inside queryUpstream (referencing parentCtx, ctx, queryUpstream,
markUpstreamFail) so tryRace and raceAll will not record race-loser
cancellations as upstream failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1014583b-f251-4af9-8150-c2115c9cdd37

📥 Commits

Reviewing files that changed from the base of the PR and between 4abea7c and c102592.

📒 Files selected for processing (12)
  • client/internal/dns/local/local.go
  • client/internal/dns/mock_server.go
  • client/internal/dns/server.go
  • client/internal/dns/server_test.go
  • client/internal/dns/upstream.go
  • client/internal/dns/upstream_android.go
  • client/internal/dns/upstream_general.go
  • client/internal/dns/upstream_ios.go
  • client/internal/dns/upstream_test.go
  • client/internal/engine.go
  • client/internal/routemanager/manager.go
  • client/internal/routemanager/mock.go
💤 Files with no reviewable changes (1)
  • client/internal/dns/local/local.go
✅ Files skipped from review due to trivial changes (2)
  • client/internal/dns/upstream_ios.go
  • client/internal/dns/server.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • client/internal/dns/upstream_android.go
  • client/internal/routemanager/mock.go
  • client/internal/dns/upstream_test.go

Comment thread client/internal/dns/upstream.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
client/internal/dns/upstream.go (1)

367-376: ⚠️ Potential issue | 🟠 Major

Check the parent context before recording a failure.

This still only suppresses errors that wrap context.Canceled. If a loser is canceled but the transport returns a read/deadline error, Line 375 records a false failure and can poison passive health.

🛡️ Proposed fix
 	if err != nil {
 		// A parent cancellation (e.g., another race won and the coordinator
 		// cancelled the losers) is not an upstream failure. Don't record it
 		// in the health map or it would poison healthy upstreams.
-		if errors.Is(err, context.Canceled) {
+		if errors.Is(err, context.Canceled) || errors.Is(parentCtx.Err(), context.Canceled) {
 			return nil, "", &upstreamFailure{upstream: upstream, reason: "canceled"}
 		}
 		failure := u.handleUpstreamError(err, upstream, startTime)
 		u.markUpstreamFail(upstream, failure.reason)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 367 - 376, Before recording a
failure, check the parent request context for cancellation and treat any error
that occurred because the parent was canceled as a non-upstream failure: if the
parent context (e.g., the ctx used to drive this request) is done (ctx.Err() !=
nil) then return an upstreamFailure with reason "canceled" without calling
u.handleUpstreamError or u.markUpstreamFail; otherwise proceed to call
u.handleUpstreamError(err, upstream, startTime) and u.markUpstreamFail(upstream,
failure.reason). Use the existing symbols err, upstream, startTime,
u.handleUpstreamError, and u.markUpstreamFail to locate and implement this
check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 345-350: The issue is that ExchangeWithFallback/tryRace mutates
the original dns.Msg (via opt.SetUDPSize) and then reuses it for sequential
upstreams, causing altered EDNS0 UDP size for later attempts; update tryRace so
it clones the original request before each call to queryUpstream (similar to how
raceAll does) — locate tryRace/ExchangeWithFallback and ensure you call the
request cloning function (or use msg.Copy()) to produce a fresh dns.Msg for
every iteration before passing it to queryUpstream, preserving the original
unmutated request for subsequent upstream attempts and keeping truncation/TCP
fallback behavior correct.

---

Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 367-376: Before recording a failure, check the parent request
context for cancellation and treat any error that occurred because the parent
was canceled as a non-upstream failure: if the parent context (e.g., the ctx
used to drive this request) is done (ctx.Err() != nil) then return an
upstreamFailure with reason "canceled" without calling u.handleUpstreamError or
u.markUpstreamFail; otherwise proceed to call u.handleUpstreamError(err,
upstream, startTime) and u.markUpstreamFail(upstream, failure.reason). Use the
existing symbols err, upstream, startTime, u.handleUpstreamError, and
u.markUpstreamFail to locate and implement this check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6b7c8d06-12d0-446b-ad0d-7c15d8c20072

📥 Commits

Reviewing files that changed from the base of the PR and between c102592 and 21616bc.

📒 Files selected for processing (3)
  • client/internal/dns/server.go
  • client/internal/dns/upstream.go
  • client/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/dns/server.go

Comment thread client/internal/dns/upstream.go
@lixmal lixmal force-pushed the drop-dns-probes branch 2 times, most recently from b961f01 to 152ba3e Compare April 23, 2026 13:58
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
client/internal/dns/upstream.go (1)

345-350: ⚠️ Potential issue | 🟠 Major

Clone the request for each sequential upstream attempt.

ExchangeWithFallback/ExchangeWithNetstack mutate EDNS0 in-place, so reusing r across fallback upstreams can make later attempts inherit the first attempt’s capped UDP size. raceAll already clones per parallel worker; do the same inside the sequential walk.

🛠️ Proposed fix
 	for _, upstream := range group {
 		if ctx.Err() != nil {
 			return raceResult{failures: failures}
 		}
-		msg, proto, failure := u.queryUpstream(ctx, r, upstream, timeout)
+		msg, proto, failure := u.queryUpstream(ctx, r.Copy(), upstream, timeout)
 		if failure != nil {
 			failures = append(failures, *failure)
 			continue
 		}

Run this read-only check to confirm the dependency version and existing clone usage:

#!/bin/bash
set -euo pipefail

rg -n 'github\.com/miekg/dns' go.mod
rg -n --type=go '\.Copy\(\)|func \(u \*upstreamResolverBase\) tryRace|queryUpstream\(ctx, r' -C3
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 345 - 350, The loop over
sequential upstreams reuses the same dns.Msg r which
ExchangeWithFallback/ExchangeWithNetstack mutate (EDNS0) causing later attempts
to inherit modified UDP size; inside the for _, upstream := range group loop
create a fresh clone of the original request (use r.Copy() or equivalent) and
pass that clone to queryUpstream instead of r so each sequential attempt uses an
unmodified message; update the code paths around queryUpstream and any callers
that rely on r to use the cloned msg for each attempt.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/server.go`:
- Around line 1052-1054: The grace window is scaled using len(snap.selected)
(the HAMap bucket count) when computing delay :=
s.warningDelay(len(snap.selected)), which undercounts routes; change the call to
pass the actual total route count by summing the lengths of each bucket in
snap.selected (i.e., compute totalRoutes := sum(len(bucket) for each bucket in
snap.selected)) and call s.warningDelay(totalRoutes) instead so s.warningDelay
receives the real route count; update the code around the variables now, delay,
and snap.selected accordingly.

---

Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 345-350: The loop over sequential upstreams reuses the same
dns.Msg r which ExchangeWithFallback/ExchangeWithNetstack mutate (EDNS0) causing
later attempts to inherit modified UDP size; inside the for _, upstream := range
group loop create a fresh clone of the original request (use r.Copy() or
equivalent) and pass that clone to queryUpstream instead of r so each sequential
attempt uses an unmodified message; update the code paths around queryUpstream
and any callers that rely on r to use the cloned msg for each attempt.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5fb54fad-0818-4a00-b330-dfd5318b5e14

📥 Commits

Reviewing files that changed from the base of the PR and between 21616bc and b961f01.

📒 Files selected for processing (4)
  • client/internal/dns/server.go
  • client/internal/dns/upstream.go
  • client/internal/dns/upstream_ios.go
  • client/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/dns/upstream_ios.go

Comment thread client/internal/dns/server.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
client/internal/dns/server.go (1)

1052-1054: ⚠️ Potential issue | 🟠 Major

Scale the grace window by actual selected-route count.

len(snap.selected) counts HAMap buckets, not the total routes inside those buckets, so the startup grace can be under-sized when many routes share a bucket.

🛠️ Proposed fix
-	delay := s.warningDelay(len(snap.selected))
+	delay := s.warningDelay(haMapRouteCount(snap.selected))
+func haMapRouteCount(hm route.HAMap) int {
+	count := 0
+	for _, routes := range hm {
+		count += len(routes)
+	}
+	return count
+}
+
 // warningDelay returns the grace window for the given selected-route
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/server.go` around lines 1052 - 1054, The grace-window
calculation uses len(snap.selected) (which counts HAMap buckets) when calling
s.warningDelay, under-sizing the delay when buckets contain many routes; change
it to compute the actual selected-route count by iterating snap.selected's
buckets and summing their route counts (e.g., summing len(...) or a Count field
for each bucket) and pass that total to s.warningDelay(selectedCount) instead of
len(snap.selected); update the variables near now, delay and the
s.warningDelay(...) call so delay = s.warningDelay(selectedCount).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/server.go`:
- Around line 1057-1063: projectNSGroupHealth currently classifies group health
using all configured nameservers via nameServerAddrPorts(group.NameServers) even
though handlers only query the subset returned by filterNameServers; this lets
skipped servers make groupHasImmediateUpstream return true and bypass overlay
grace. Fix by creating two variables: configuredServers :=
nameServerAddrPorts(group.NameServers) and servers :=
filterNameServers(configuredServers) (or call nameServerAddrPorts only for
configuredServers and derive the queried subset via filterNameServers), then use
servers when calling evaluateNSGroupHealth and groupHasImmediateUpstream while
storing configuredServers in NSGroupState.Servers so status still shows the full
configured list. Ensure the same change is applied at the other occurrences
mentioned (around lines 919-933 and 1299-1306).

---

Duplicate comments:
In `@client/internal/dns/server.go`:
- Around line 1052-1054: The grace-window calculation uses len(snap.selected)
(which counts HAMap buckets) when calling s.warningDelay, under-sizing the delay
when buckets contain many routes; change it to compute the actual selected-route
count by iterating snap.selected's buckets and summing their route counts (e.g.,
summing len(...) or a Count field for each bucket) and pass that total to
s.warningDelay(selectedCount) instead of len(snap.selected); update the
variables near now, delay and the s.warningDelay(...) call so delay =
s.warningDelay(selectedCount).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f55ca1f7-c83a-43ba-bf87-cf03de47eba4

📥 Commits

Reviewing files that changed from the base of the PR and between b961f01 and 152ba3e.

📒 Files selected for processing (4)
  • client/internal/dns/server.go
  • client/internal/dns/upstream.go
  • client/internal/dns/upstream_ios.go
  • client/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/dns/upstream.go

Comment thread client/internal/dns/server.go Outdated
@lixmal lixmal force-pushed the drop-dns-probes branch 3 times, most recently from 85178a3 to 10ecf7c Compare April 23, 2026 15:40
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
client/internal/dns/upstream.go (1)

199-210: ⚠️ Potential issue | 🟠 Major

Include race topology in the handler ID.

ID() flattens and sorts upstreams, but group boundaries and within-group order now change behavior. For example, one race [a,b] is sequential fallback, while two races [a] and [b] run in parallel, yet both hash to the same ID. This can leave stale handler behavior across DNS config updates.

🛠️ Proposed fix
 func (u *upstreamResolverBase) ID() types.HandlerID {
-	servers := u.flatUpstreams()
-	slices.SortFunc(servers, func(a, b netip.AddrPort) int { return a.Compare(b) })
-
 	hash := sha256.New()
 	hash.Write([]byte(u.domain.PunycodeString() + ":"))
-	for _, s := range servers {
-		hash.Write([]byte(s.String()))
-		hash.Write([]byte("|"))
+
+	groups := make([]string, 0, len(u.upstreamServers))
+	for _, group := range u.upstreamServers {
+		var b strings.Builder
+		for _, s := range group {
+			b.WriteString(s.String())
+			b.WriteByte('|')
+		}
+		groups = append(groups, b.String())
+	}
+	slices.Sort(groups)
+
+	for _, group := range groups {
+		hash.Write([]byte("["))
+		hash.Write([]byte(group))
+		hash.Write([]byte("]"))
 	}
 	return types.HandlerID("upstream-" + hex.EncodeToString(hash.Sum(nil)[:8]))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 199 - 210, The current ID()
uses flatUpstreams() which loses group topology so different race/fallback
groupings can produce the same hash; change ID() in upstreamResolverBase to hash
the grouped topology instead of the flattened list: iterate the original grouped
upstream structure (preserving group boundaries and the within-group order used
at run-time), write a distinct separator marker into the SHA256 stream between
groups and include each AddrPort entry in each group, then produce the HandlerID
from the truncated hash as before; reference the upstreamResolverBase.ID method
and avoid using flatUpstreams() so group boundaries and ordering are reflected
in the ID.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@client/internal/dns/upstream.go`:
- Around line 199-210: The current ID() uses flatUpstreams() which loses group
topology so different race/fallback groupings can produce the same hash; change
ID() in upstreamResolverBase to hash the grouped topology instead of the
flattened list: iterate the original grouped upstream structure (preserving
group boundaries and the within-group order used at run-time), write a distinct
separator marker into the SHA256 stream between groups and include each AddrPort
entry in each group, then produce the HandlerID from the truncated hash as
before; reference the upstreamResolverBase.ID method and avoid using
flatUpstreams() so group boundaries and ordering are reflected in the ID.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a87a52fc-29c1-45ac-9f66-e1b4a6dd0406

📥 Commits

Reviewing files that changed from the base of the PR and between b961f01 and 10ecf7c.

📒 Files selected for processing (5)
  • client/internal/dns/server.go
  • client/internal/dns/server_test.go
  • client/internal/dns/upstream.go
  • client/internal/dns/upstream_ios.go
  • client/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/dns/server.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
client/internal/dns/upstream_ios.go (1)

72-80: ⚠️ Potential issue | 🟠 Major

TCP fallback for iOS private DNS client will fail due to mismatched address type.

When an inbound DNS query arrives over TCP and ExchangeWithFallback detects it via dnsProtocolFromContext(ctx), it attempts to exchange via TCP using the client's dialer. The private iOS client at lines 92–98 sets LocalAddr to &net.UDPAddr, which Go's net.Dialer rejects for TCP dials with "mismatched local address type" (as confirmed by the Go runtime).

Use either:

  • *net.TCPAddr for TCP exchanges instead of *net.UDPAddr, or
  • Remove LocalAddr and rely solely on IP_BOUND_IF for interface binding (verify this works on iOS with the Control callback).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream_ios.go` around lines 72 - 80, The iOS private
client sets Dialer.LocalAddr to a *net.UDPAddr which causes "mismatched local
address type" when ExchangeWithFallback (which checks
dnsProtocolFromContext(ctx)) falls back to TCP; update GetClientPrivate (or the
code path creating the net.Dialer used before calling ExchangeWithFallback) to
either set LocalAddr to a *net.TCPAddr when the dial will be TCP (detect via
dnsProtocolFromContext(ctx) or pass a protocol flag into GetClientPrivate) or
remove LocalAddr entirely and rely only on the Control callback / IP_BOUND_IF
for interface binding; ensure the change targets the dialer construction used by
ExchangeWithFallback so TCP dials no longer use a UDPAddr.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/server.go`:
- Around line 1331-1336: generateGroupKey builds the nsGroupID using the slices'
current order which makes the key unstable across order-only config changes;
update generateGroupKey to produce a deterministic key by collecting the
NameServers' AddrPort().String() values, normalizing if needed, sorting that
servers slice, sorting nsGroup.Domains as well, then format the sorted servers
and sorted domains into the nsGroupID; this ensures NameServerGroup keys
(nsGroupID) are order-stable and preserves everHealthy/warningActive state.
- Around line 316-323: SetRouteSources currently only updates
DefaultServer.selectedRoutes/activeRoutes but existing upstream handlers created
earlier still hold copied callbacks and stay nil; update SetRouteSources to
propagate the new callbacks to those handlers by iterating the server's upstream
handler registry and calling a setter on each handler's resolver (add a small
method like SetRouteSources or SetSelectedActive on the concrete resolver type
used in this package), using a type assertion to the concrete resolver
implementation to call that setter; alternatively change upstream handlers to
read server.selectedRoutes/activeRoutes via a shared accessor rather than
copying the function so new values are seen immediately.
- Around line 1094-1105: When verdict == nsVerdictUndecided we must clear any
previously held unhealthy streak so future fresh failures don't inherit a stale
unhealthySince; modify the nsVerdictUndecided branch (in the switch handling
verdict) to reset the group's unhealthy streak by setting p.unhealthySince =
time.Time{} (and any related counters if present, e.g. p.unhealthyCount or
p.consecutiveUnhealthy) in addition to preserving the existing warningActive
logic and groupErr = nil so that projectUnhealthy/projectHealthy behavior
remains correct.

---

Outside diff comments:
In `@client/internal/dns/upstream_ios.go`:
- Around line 72-80: The iOS private client sets Dialer.LocalAddr to a
*net.UDPAddr which causes "mismatched local address type" when
ExchangeWithFallback (which checks dnsProtocolFromContext(ctx)) falls back to
TCP; update GetClientPrivate (or the code path creating the net.Dialer used
before calling ExchangeWithFallback) to either set LocalAddr to a *net.TCPAddr
when the dial will be TCP (detect via dnsProtocolFromContext(ctx) or pass a
protocol flag into GetClientPrivate) or remove LocalAddr entirely and rely only
on the Control callback / IP_BOUND_IF for interface binding; ensure the change
targets the dialer construction used by ExchangeWithFallback so TCP dials no
longer use a UDPAddr.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 882717d9-90bb-4fa0-b560-39e3d2ce6289

📥 Commits

Reviewing files that changed from the base of the PR and between b961f01 and 10ecf7c.

📒 Files selected for processing (5)
  • client/internal/dns/server.go
  • client/internal/dns/server_test.go
  • client/internal/dns/upstream.go
  • client/internal/dns/upstream_ios.go
  • client/internal/dns/upstream_test.go

Comment thread client/internal/dns/server.go
Comment thread client/internal/dns/server.go
Comment thread client/internal/dns/server.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
client/internal/dns/server.go (3)

316-323: ⚠️ Potential issue | 🟠 Major

Propagate route source changes to existing upstream handlers.

SetRouteSources updates only the server fields, while upstream handlers copy selectedRoutes when created. Handlers created before route-manager wiring, especially permanent/iOS root handlers, can keep a nil callback and misclassify routed upstreams as public or avoid the iOS private client path. Prefer a shared route-source accessor or propagate the new callback to all existing upstream handlers, including host-root/fallback handlers.

Also applies to: 792-794, 894-895, 1006-1008

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/server.go` around lines 316 - 323, SetRouteSources
currently only updates DefaultServer.selectedRoutes/activeRoutes but does not
update already-created upstream handlers that captured the previous callbacks;
modify SetRouteSources to either (a) switch handlers to use a shared accessor
function stored on DefaultServer (e.g., have handlers call
s.selectedRoutes()/s.activeRoutes() at runtime) or (b) iterate all existing
upstream handler instances (including host-root and fallback handlers) and set
their route source callback fields to the new selected/active functions so they
stop using a nil or stale callback; update SetRouteSources to perform this
propagation and ensure any types named upstream handler, host-root handler, or
fallback handler are updated accordingly.

1334-1339: ⚠️ Potential issue | 🟡 Minor

Make the NS group key order-stable.

generateGroupKey still depends on the configured slice order. An order-only refresh for the same group creates a new projection ID, losing everHealthy/warningActive and potentially re-emitting warnings.

Proposed fix
+	"slices"
 	"strings"
 	"sync"
 	"time"
 func generateGroupKey(nsGroup *nbdns.NameServerGroup) nsGroupID {
 	var servers []string
 	for _, ns := range nsGroup.NameServers {
 		servers = append(servers, ns.AddrPort().String())
 	}
-	return nsGroupID(fmt.Sprintf("%v_%v", servers, nsGroup.Domains))
+	slices.Sort(servers)
+	domains := slices.Clone(nsGroup.Domains)
+	slices.Sort(domains)
+	return nsGroupID(fmt.Sprintf("%v_%v", servers, domains))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/server.go` around lines 1334 - 1339, generateGroupKey
builds an unstable key because it depends on the iteration order of
nsGroup.NameServers (and potentially nsGroup.Domains); to fix, produce a
deterministic key in generateGroupKey by collecting the NameServers addresses
into a slice, sorting that slice, also sort nsGroup.Domains (if it's a slice),
then format the sorted slices into the nsGroupID so the key is stable across
reorderings and preserves everHealthy/warningActive state.

1100-1105: ⚠️ Potential issue | 🟠 Major

Reset held unhealthy streaks when the verdict becomes undecided.

A routed failure suppressed during the grace window sets unhealthySince; if that evidence ages out before a warning fires, the stale timestamp is retained and a later fresh failure can emit immediately.

Proposed fix
 		case nsVerdictUndecided:
 			// Stay Available until evidence says otherwise, unless a
 			// warning is already active for this group.
 			enabled = !p.warningActive
 			groupErr = nil
+			if !p.warningActive {
+				p.unhealthySince = time.Time{}
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/server.go` around lines 1100 - 1105, When handling
nsVerdictUndecided in the switch, reset the held unhealthy streak timestamp so
stale `unhealthySince` doesn't cause an immediate warning later; in the
nsVerdictUndecided branch (where you set `enabled = !p.warningActive` and
`groupErr = nil`), also clear `p.unhealthySince` (e.g., set it to the zero value
for its type, such as `time.Time{}` or `nil` if it's a pointer) so any aged
suppression is forgotten when the verdict becomes undecided.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@client/internal/dns/server.go`:
- Around line 316-323: SetRouteSources currently only updates
DefaultServer.selectedRoutes/activeRoutes but does not update already-created
upstream handlers that captured the previous callbacks; modify SetRouteSources
to either (a) switch handlers to use a shared accessor function stored on
DefaultServer (e.g., have handlers call s.selectedRoutes()/s.activeRoutes() at
runtime) or (b) iterate all existing upstream handler instances (including
host-root and fallback handlers) and set their route source callback fields to
the new selected/active functions so they stop using a nil or stale callback;
update SetRouteSources to perform this propagation and ensure any types named
upstream handler, host-root handler, or fallback handler are updated
accordingly.
- Around line 1334-1339: generateGroupKey builds an unstable key because it
depends on the iteration order of nsGroup.NameServers (and potentially
nsGroup.Domains); to fix, produce a deterministic key in generateGroupKey by
collecting the NameServers addresses into a slice, sorting that slice, also sort
nsGroup.Domains (if it's a slice), then format the sorted slices into the
nsGroupID so the key is stable across reorderings and preserves
everHealthy/warningActive state.
- Around line 1100-1105: When handling nsVerdictUndecided in the switch, reset
the held unhealthy streak timestamp so stale `unhealthySince` doesn't cause an
immediate warning later; in the nsVerdictUndecided branch (where you set
`enabled = !p.warningActive` and `groupErr = nil`), also clear
`p.unhealthySince` (e.g., set it to the zero value for its type, such as
`time.Time{}` or `nil` if it's a pointer) so any aged suppression is forgotten
when the verdict becomes undecided.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c31a09ca-0bb6-4fc7-ad93-eb026f0fb034

📥 Commits

Reviewing files that changed from the base of the PR and between 10ecf7c and f235e76.

📒 Files selected for processing (5)
  • client/internal/dns/server.go
  • client/internal/dns/server_test.go
  • client/internal/dns/upstream.go
  • client/internal/dns/upstream_ios.go
  • client/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/dns/upstream_test.go

@lixmal lixmal force-pushed the drop-dns-probes branch 2 times, most recently from d01b451 to d7a3195 Compare April 23, 2026 16:05
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
client/internal/dns/upstream.go (1)

199-210: ⚠️ Potential issue | 🟠 Major

Include race boundaries and fallback order in the handler ID.

Line 201 flattens and sorts upstreams before hashing, but the new model makes upstreamRace boundaries and in-race order behaviorally significant. Configs like [[a,b]] vs [[a],[b]], or [a,b] vs [b,a], now produce the same handler ID while resolving differently. Hash the races with delimiters and preserve fallback order.

🛠️ Proposed fix
 func (u *upstreamResolverBase) ID() types.HandlerID {
-	servers := u.flatUpstreams()
-	slices.SortFunc(servers, func(a, b netip.AddrPort) int { return a.Compare(b) })
-
 	hash := sha256.New()
 	hash.Write([]byte(u.domain.PunycodeString() + ":"))
-	for _, s := range servers {
-		hash.Write([]byte(s.String()))
-		hash.Write([]byte("|"))
+	for _, race := range u.upstreamServers {
+		hash.Write([]byte("["))
+		for _, s := range race {
+			hash.Write([]byte(s.String()))
+			hash.Write([]byte(","))
+		}
+		hash.Write([]byte("]"))
 	}
 	return types.HandlerID("upstream-" + hex.EncodeToString(hash.Sum(nil)[:8]))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 199 - 210, The current
upstreamResolverBase.ID() flattens and sorts servers via flatUpstreams(), losing
upstreamRace boundaries and fallback order; change the ID generation to iterate
the resolver's race groups in order (preserve each race's internal order or sort
within a race if intended), append a clear delimiter between races (e.g., "["
and "]" or another separator) when writing to the hash, and include the fallback
sequence as part of the hashed input so that different race groupings or
orderings (e.g., [[a,b]] vs [[a],[b]] or [a,b] vs [b,a]) produce distinct
HandlerIDs; update references in upstreamResolverBase.ID(), remove reliance on
flatUpstreams() for the hash input, and ensure the final
hex.EncodeToString(hash.Sum(nil)[:8]) still returns the ID.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/upstream.go`:
- Around line 744-751: haMapContains currently treats any matching entry in
route.HAMap as "routed" even when the match comes from a dynamic (domain-based)
route placeholder Network, causing dynamic-only upstreams to bypass the startup
grace window; update haMapContains to detect dynamic routes (e.g., check route
entries for the dynamic flag or placeholder Network on the
route.Route/route.Entry objects) and return a tri-state or
"unknown/dynamic-selected" result instead of plain true, so callers can apply
the startup grace window for dynamic-only upstreams (or alternatively expose the
resolved dynamic route prefixes to this check and only consider concrete prefix
matches as routed).

---

Outside diff comments:
In `@client/internal/dns/upstream.go`:
- Around line 199-210: The current upstreamResolverBase.ID() flattens and sorts
servers via flatUpstreams(), losing upstreamRace boundaries and fallback order;
change the ID generation to iterate the resolver's race groups in order
(preserve each race's internal order or sort within a race if intended), append
a clear delimiter between races (e.g., "[" and "]" or another separator) when
writing to the hash, and include the fallback sequence as part of the hashed
input so that different race groupings or orderings (e.g., [[a,b]] vs [[a],[b]]
or [a,b] vs [b,a]) produce distinct HandlerIDs; update references in
upstreamResolverBase.ID(), remove reliance on flatUpstreams() for the hash
input, and ensure the final hex.EncodeToString(hash.Sum(nil)[:8]) still returns
the ID.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6ea0f54b-ca5d-44f8-a7f1-ad2ed0f024e1

📥 Commits

Reviewing files that changed from the base of the PR and between f235e76 and d01b451.

📒 Files selected for processing (5)
  • client/internal/dns/server.go
  • client/internal/dns/server_test.go
  • client/internal/dns/upstream.go
  • client/internal/dns/upstream_ios.go
  • client/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • client/internal/dns/upstream_ios.go
  • client/internal/dns/server.go

Comment thread client/internal/dns/upstream.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
client/internal/dns/upstream.go (1)

199-210: ⚠️ Potential issue | 🟠 Major

Include race grouping and fallback order in the handler ID.

ID() sorts the flattened upstream set, but the new model makes both group boundaries and in-race order semantically meaningful. Configs like [A,B] vs [B,A], or one race [A,B] vs two races [A] [B], currently hash to the same handler ID and can prevent a changed resolver topology from being treated as a distinct handler.

Suggested fix
 func (u *upstreamResolverBase) ID() types.HandlerID {
-	servers := u.flatUpstreams()
-	slices.SortFunc(servers, func(a, b netip.AddrPort) int { return a.Compare(b) })
-
 	hash := sha256.New()
 	hash.Write([]byte(u.domain.PunycodeString() + ":"))
-	for _, s := range servers {
-		hash.Write([]byte(s.String()))
-		hash.Write([]byte("|"))
+	for _, race := range u.upstreamServers {
+		hash.Write([]byte("["))
+		for _, s := range race {
+			hash.Write([]byte(s.String()))
+			hash.Write([]byte(","))
+		}
+		hash.Write([]byte("]"))
 	}
 	return types.HandlerID("upstream-" + hex.EncodeToString(hash.Sum(nil)[:8]))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 199 - 210, The ID() currently
hashes the flattened server list from u.flatUpstreams(), losing race-group
boundaries and fallback ordering; update upstreamResolverBase.ID() to iterate
the actual upstream grouping structure (preserve group boundaries and the order
of servers within each race and the order of races) instead of using
flatUpstreams(), and include explicit separators/tokens when writing to the hash
(e.g., start/end-of-group markers and an inter-server separator) so different
configurations like [A,B] vs [B,A] or [A][B] produce distinct HandlerIDs while
still incorporating u.domain.PunycodeString().
♻️ Duplicate comments (1)
client/internal/dns/upstream.go (1)

758-765: ⚠️ Potential issue | 🟠 Major

Dynamic-route upstreams still bypass the startup grace path.

The documented gap means DNS upstreams reachable only through selected dynamic routes are classified as “not routed,” so failures emit immediately instead of using the startup grace window. That preserves a false-positive warning path this PR is trying to remove; consider a tri-state route classification or feeding resolved dynamic prefixes into this check.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 758 - 765, haMapContains
currently treats any route with a placeholder Network (domain-based/dynamic
routes) as "not routed" which causes dynamic-only upstreams to skip the startup
grace window; update haMapContains (and related use of route.HAMap) to
distinguish three states (routed, not routed, unknown/dynamic) by detecting
dynamic/domain routes (e.g. via route.Route flags or route.IsDynamic/Domain
fields) and either (a) consult resolved dynamic prefixes (pass in or query a
resolved-prefixes map from the dynamic resolver) and treat matching resolved
prefixes as routed, or (b) return an "unknown" result that callers can treat as
routed-for-grace; adjust callers of haMapContains accordingly so dynamic-only
upstreams are not immediately marked failing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@client/internal/dns/upstream.go`:
- Around line 199-210: The ID() currently hashes the flattened server list from
u.flatUpstreams(), losing race-group boundaries and fallback ordering; update
upstreamResolverBase.ID() to iterate the actual upstream grouping structure
(preserve group boundaries and the order of servers within each race and the
order of races) instead of using flatUpstreams(), and include explicit
separators/tokens when writing to the hash (e.g., start/end-of-group markers and
an inter-server separator) so different configurations like [A,B] vs [B,A] or
[A][B] produce distinct HandlerIDs while still incorporating
u.domain.PunycodeString().

---

Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 758-765: haMapContains currently treats any route with a
placeholder Network (domain-based/dynamic routes) as "not routed" which causes
dynamic-only upstreams to skip the startup grace window; update haMapContains
(and related use of route.HAMap) to distinguish three states (routed, not
routed, unknown/dynamic) by detecting dynamic/domain routes (e.g. via
route.Route flags or route.IsDynamic/Domain fields) and either (a) consult
resolved dynamic prefixes (pass in or query a resolved-prefixes map from the
dynamic resolver) and treat matching resolved prefixes as routed, or (b) return
an "unknown" result that callers can treat as routed-for-grace; adjust callers
of haMapContains accordingly so dynamic-only upstreams are not immediately
marked failing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9bfbddb5-aeee-435a-855f-e6b2f36398c7

📥 Commits

Reviewing files that changed from the base of the PR and between d01b451 and d7a3195.

📒 Files selected for processing (5)
  • client/internal/dns/server.go
  • client/internal/dns/server_test.go
  • client/internal/dns/upstream.go
  • client/internal/dns/upstream_ios.go
  • client/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • client/internal/dns/upstream_ios.go
  • client/internal/dns/server.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
client/internal/dns/upstream.go (1)

558-570: ⚠️ Potential issue | 🟠 Major

Copy the DNS client before setting UDPSize.

ExchangeWithFallback mutates the caller-provided *dns.Client at line 565. Under a parallel race model, a shared upstream client can race across concurrent exchanges or leak this per-request UDP size into subsequent requests. Mirror toTCPClient by copying the client for the UDP exchange path.

🛠️ Proposed fix
 	clientMaxSize := clientUDPMaxSize(r)
+	udpClient := *client
 
 	// Cap EDNS0 to our tunnel MTU so the upstream doesn't send a
 	// response larger than our read buffer.
 	// Note: the query could be sent out on an interface that is not ours,
 	// but higher MTU settings could break truncation handling.
 	maxUDPPayload := uint16(currentMTU - ipUDPHeaderSize)
-	client.UDPSize = maxUDPPayload
+	udpClient.UDPSize = maxUDPPayload
 	if opt := r.IsEdns0(); opt != nil && opt.UDPSize() > maxUDPPayload {
 		opt.SetUDPSize(maxUDPPayload)
 	}
 
-	rm, t, err := client.ExchangeContext(ctx, r, upstream)
+	rm, t, err := udpClient.ExchangeContext(ctx, r, upstream)
 	if err != nil {
 		return nil, t, fmt.Errorf("with udp: %w", err)
 	}
 
-	rm, t, err = toTCPClient(client).ExchangeContext(ctx, r, upstream)
+	rm, t, err = toTCPClient(&udpClient).ExchangeContext(ctx, r, upstream)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 558 - 570, The upstream DNS
client is being mutated (client.UDPSize and EDNS0 UDP size) before calling
ExchangeContext/ExchangeWithFallback, which can race when the same *dns.Client
is shared; fix by making a copy of the client before mutating and use that copy
for the UDP exchange path (mirror the toTCPClient approach): create a local copy
of the dns.Client value, set copy.UDPSize and adjust EDNS0 on the copy, then
call ExchangeContext/ExchangeWithFallback with &copy instead of the original
client to avoid leaking per-request UDPSize across concurrent requests.
♻️ Duplicate comments (1)
client/internal/dns/upstream.go (1)

769-776: ⚠️ Potential issue | 🟠 Major

Dynamic-route upstreams still bypass the startup grace path.

The documented gap means DNS upstreams reachable only via selected dynamic routes are classified as “not routed,” so failures warn immediately instead of using the overlay startup grace window. This keeps one false-positive path open for the PR’s main objective.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream.go` around lines 769 - 776, haMapContains
currently treats upstreams reachable only via dynamic (domain-based) routes as
"not routed" because it only matches concrete Networks, causing those upstreams
to bypass the startup grace window; update haMapContains to also consider
dynamic routes as a match — e.g., iterate the route.HAMap entries and treat
entries whose route.Network is the dynamic placeholder (or which are marked
dynamic) as a valid match for the given ip so they are classified as routed for
the DNS health startup grace path; reference haMapContains, route.HAMap,
route.Network (placeholder/dynamic marker) and the ip parameter when locating
and changing the matching logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@client/internal/dns/upstream.go`:
- Around line 558-570: The upstream DNS client is being mutated (client.UDPSize
and EDNS0 UDP size) before calling ExchangeContext/ExchangeWithFallback, which
can race when the same *dns.Client is shared; fix by making a copy of the client
before mutating and use that copy for the UDP exchange path (mirror the
toTCPClient approach): create a local copy of the dns.Client value, set
copy.UDPSize and adjust EDNS0 on the copy, then call
ExchangeContext/ExchangeWithFallback with &copy instead of the original client
to avoid leaking per-request UDPSize across concurrent requests.

---

Duplicate comments:
In `@client/internal/dns/upstream.go`:
- Around line 769-776: haMapContains currently treats upstreams reachable only
via dynamic (domain-based) routes as "not routed" because it only matches
concrete Networks, causing those upstreams to bypass the startup grace window;
update haMapContains to also consider dynamic routes as a match — e.g., iterate
the route.HAMap entries and treat entries whose route.Network is the dynamic
placeholder (or which are marked dynamic) as a valid match for the given ip so
they are classified as routed for the DNS health startup grace path; reference
haMapContains, route.HAMap, route.Network (placeholder/dynamic marker) and the
ip parameter when locating and changing the matching logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c0a71c83-1174-4b5a-b49f-9153607f45bb

📥 Commits

Reviewing files that changed from the base of the PR and between d7a3195 and b692d65.

📒 Files selected for processing (5)
  • client/internal/dns/server.go
  • client/internal/dns/server_test.go
  • client/internal/dns/upstream.go
  • client/internal/dns/upstream_ios.go
  • client/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/dns/upstream_ios.go

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/dns/upstream_test.go`:
- Around line 439-441: The comment incorrectly states "the broken group's
timeout (100ms)"—update the test comment near the assertion that checks elapsed
< 100*time.Millisecond to accurately reflect that 100ms is the assertion
threshold proving parallelism, not the configured broken-group timeout; mention
the actual per-upstream timeout variable upstreamTimeout (250ms) and the mock
delay (500ms) or reword to: "the assertion threshold (100ms) proves parallelism;
the broken group's configured timeout is upstreamTimeout (250ms) and its mock
delay is 500ms" so readers won't confuse the numbers.
🪄 Autofix (Beta)

❌ Autofix failed (check again to retry)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 927c4a2d-61d8-427d-9483-59940fd674b9

📥 Commits

Reviewing files that changed from the base of the PR and between b692d65 and d0f9d80.

📒 Files selected for processing (5)
  • client/internal/dns/server.go
  • client/internal/dns/server_test.go
  • client/internal/dns/upstream.go
  • client/internal/dns/upstream_ios.go
  • client/internal/dns/upstream_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/dns/server.go

Comment on lines +439 to +441
// Working group answers in a single RTT; the broken group's
// timeout (100ms) must not block the response.
assert.Less(t, elapsed, 100*time.Millisecond, "race must not wait for broken group's timeout")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comment doesn't match the configured values.

The inline comment says the broken group's timeout (100ms), but the broken group's actual per-upstream timeout is upstreamTimeout = 250ms (line 417) and its mock delay is 500ms (line 405). The 100 * time.Millisecond here is just the assertion threshold used to prove parallelism, not a configured timeout.

✏️ Suggested wording
-	// Working group answers in a single RTT; the broken group's
-	// timeout (100ms) must not block the response.
+	// Working group answers in a single RTT. Assert well below the broken
+	// group's 250ms per-upstream timeout (and its 500ms mock delay) to
+	// prove races run in parallel, not serially.
 	assert.Less(t, elapsed, 100*time.Millisecond, "race must not wait for broken group's timeout")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Working group answers in a single RTT; the broken group's
// timeout (100ms) must not block the response.
assert.Less(t, elapsed, 100*time.Millisecond, "race must not wait for broken group's timeout")
// Working group answers in a single RTT. Assert well below the broken
// group's 250ms per-upstream timeout (and its 500ms mock delay) to
// prove races run in parallel, not serially.
assert.Less(t, elapsed, 100*time.Millisecond, "race must not wait for broken group's timeout")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/dns/upstream_test.go` around lines 439 - 441, The comment
incorrectly states "the broken group's timeout (100ms)"—update the test comment
near the assertion that checks elapsed < 100*time.Millisecond to accurately
reflect that 100ms is the assertion threshold proving parallelism, not the
configured broken-group timeout; mention the actual per-upstream timeout
variable upstreamTimeout (250ms) and the mock delay (500ms) or reword to: "the
assertion threshold (100ms) proves parallelism; the broken group's configured
timeout is upstreamTimeout (250ms) and its mock delay is 500ms" so readers won't
confuse the numbers.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

❌ Failed to clone repository into sandbox. Please try again.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

❌ Failed to clone repository into sandbox. Please try again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] macOS: DNS probe fires before WireGuard routes are established — false "Unable to reach DNS servers" warning on every reconnection

1 participant