[client] Fix WGIface.Close deadlock when DNS filter hook re-enters GetDevice#5916
Conversation
…tDevice
WGIface.Close() took w.mu and held it across w.tun.Close(). The
underlying wireguard-go device waits for its send/receive goroutines to
drain before Close() returns, and some of those goroutines re-enter
WGIface during shutdown. In particular, the userspace packet filter DNS
hook in client/internal/dns.ServiceViaMemory.filterDNSTraffic calls
s.wgInterface.GetDevice() on every packet, which also needs w.mu. With
the Close-side holding the mutex, the read goroutine blocks in
GetDevice and Close waits forever for that goroutine to exit:
goroutine N (TestDNSPermanent_updateUpstream):
WGIface.Close -> holds w.mu -> tun.Close -> sync.WaitGroup.Wait
goroutine M (wireguard read routine):
FilteredDevice.Read -> filterOutbound -> udpHooksDrop ->
filterDNSTraffic.func1 -> WGIface.GetDevice -> sync.Mutex.Lock
This surfaces as a 5 minute test timeout on the macOS Client/Unit
CI job (panic: test timed out after 5m0s, running tests:
TestDNSPermanent_updateUpstream).
Release w.mu before calling w.tun.Close(). The other Close steps
(wgProxyFactory.Free, waitUntilRemoved, Destroy) do not mutate any
fields guarded by w.mu beyond what Free() already does, so the lock
is not needed once the tun has started shutting down. A new unit test
in iface_close_test.go uses a fake WGTunDevice to reproduce the
deadlock deterministically without requiring CAP_NET_ADMIN.
📝 WalkthroughWalkthroughThis PR fixes a deadlock issue in the WGIface shutdown sequence. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/iface/iface_close_test.go`:
- Around line 107-112: The test currently only waits for the Close() goroutine
to finish after closing tun.unblockClose, but it discards any returned error;
adjust the assertion to receive the Close result from closeDone (e.g., err :=
<-closeDone within the select) and fail the test if err is non-nil so the test
verifies both that WGIface.Close() returns and that it returns a nil/clean
error; reference the goroutine result channel closeDone and the unblock trigger
tun.unblockClose and assert the value coming back from the Close call.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 728a0651-9e49-4375-9d26-756ad6aefce8
📒 Files selected for processing (2)
client/iface/iface.goclient/iface/iface_close_test.go
| close(tun.unblockClose) | ||
| select { | ||
| case <-closeDone: | ||
| case <-time.After(2 * time.Second): | ||
| t.Fatal("WGIface.Close() never returned after the tun was unblocked") | ||
| } |
There was a problem hiding this comment.
Assert the Close() result instead of discarding it.
If shutdown returns an unexpected error, this regression test currently still passes. Checking the buffered error keeps the test focused on both “no deadlock” and “clean close”.
🧪 Proposed test tightening
close(tun.unblockClose)
select {
-case <-closeDone:
+case err := <-closeDone:
+ if err != nil {
+ t.Fatalf("unexpected close error: %v", err)
+ }
case <-time.After(2 * time.Second):
t.Fatal("WGIface.Close() never returned after the tun was unblocked")
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| close(tun.unblockClose) | |
| select { | |
| case <-closeDone: | |
| case <-time.After(2 * time.Second): | |
| t.Fatal("WGIface.Close() never returned after the tun was unblocked") | |
| } | |
| close(tun.unblockClose) | |
| select { | |
| case err := <-closeDone: | |
| if err != nil { | |
| t.Fatalf("unexpected close error: %v", err) | |
| } | |
| case <-time.After(2 * time.Second): | |
| t.Fatal("WGIface.Close() never returned after the tun was unblocked") | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@client/iface/iface_close_test.go` around lines 107 - 112, The test currently
only waits for the Close() goroutine to finish after closing tun.unblockClose,
but it discards any returned error; adjust the assertion to receive the Close
result from closeDone (e.g., err := <-closeDone within the select) and fail the
test if err is non-nil so the test verifies both that WGIface.Close() returns
and that it returns a nil/clean error; reference the goroutine result channel
closeDone and the unblock trigger tun.unblockClose and assert the value coming
back from the Close call.



Describe your changes
WGIface.Close()takesw.muand holds it acrossw.tun.Close(). The underlying wireguard-go device waits for its send/receive goroutines to drain before returning, and some of those goroutines re-enterWGIfaceduring shutdown.Specifically, the userspace packet filter DNS hook in
client/internal/dns.ServiceViaMemory.filterDNSTrafficcallss.wgInterface.GetDevice()on every packet — which also needsw.mu. WithCloseholding the mutex, the read goroutine blocks inGetDevice, andClosewaits forever for that goroutine to exit.Fix: release
w.mubefore callingw.tun.Close(). The remaining steps inClose(waitUntilRemoved,Destroy) only callw.Name(), which readsw.tun.DeviceName()lock-free, so they don't need the mutex either.Stack trace from a failing CI run:
This surfaces as a 5-minute timeout on the macOS Client / Unit CI job:
Seen e.g. in #5807 and earlier PRs that trigger the macOS runner — the failure is unrelated to those PRs' code paths.
Issue ticket number and link
No tracking issue; discovered while triaging the flaky macOS CI job referenced above.
Stack
Checklist
Documentation
Select exactly one:
Internal concurrency fix in
client/iface. The public WGIface API surface and its behavior are unchanged; callers see the same methods with the same contract. No user-visible configuration or workflow changes.Docs PR URL (required if "docs added" is checked)
N/A
Test
New
client/iface/iface_close_test.goreproduces the deadlock with a fakeWGTunDevicewhoseClose()blocks on a channel — simulating the wireguard-go goroutine drain. A parallelGetDevice()call would block forever under the old code; with the fix it returns immediately.FAIL: TestWGIface_CloseReleasesMutexBeforeTunClose — GetDevice() deadlocked while WGIface.Close was closing the tungo test -race -count=20clean.The test does not need
CAP_NET_ADMIN//dev/net/tun, so it runs on all CI environments including sandboxed containers.