Skip to content

[client] Fix WGIface.Close deadlock when DNS filter hook re-enters GetDevice#5916

Merged
pappz merged 1 commit intonetbirdio:mainfrom
MichaelUray:fix/wgiface-close-deadlock
Apr 20, 2026
Merged

[client] Fix WGIface.Close deadlock when DNS filter hook re-enters GetDevice#5916
pappz merged 1 commit intonetbirdio:mainfrom
MichaelUray:fix/wgiface-close-deadlock

Conversation

@MichaelUray
Copy link
Copy Markdown
Contributor

@MichaelUray MichaelUray commented Apr 18, 2026

Describe your changes

WGIface.Close() takes w.mu and holds it across w.tun.Close(). The underlying wireguard-go device waits for its send/receive goroutines to drain before returning, and some of those goroutines re-enter WGIface during shutdown.

Specifically, the userspace packet filter DNS hook in client/internal/dns.ServiceViaMemory.filterDNSTraffic calls s.wgInterface.GetDevice() on every packet — which also needs w.mu. With Close holding the mutex, the read goroutine blocks in GetDevice, and Close waits forever for that goroutine to exit.

Fix: release w.mu before calling w.tun.Close(). The remaining steps in Close (waitUntilRemoved, Destroy) only call w.Name(), which reads w.tun.DeviceName() lock-free, so they don't need the mutex either.

Stack trace from a failing CI run:

goroutine 1352 [sync.WaitGroup.Wait, 4 minutes]:
  WGIface.Close  -> holds w.mu
  TunDevice.Close
  wireguard.Device.Close  -> WaitGroup.Wait (blocked)
  TestDNSPermanent_updateUpstream

goroutine 1367 [sync.Mutex.Lock, 4 minutes]:
  WGIface.GetDevice  -> blocked on w.mu
  ServiceViaMemory.filterDNSTraffic.func1
  udpHooksDrop
  filterOutbound
  FilteredDevice.Read
  wireguard.Device.RoutineReadFromTUN

This surfaces as a 5-minute timeout on the macOS Client / Unit CI job:

panic: test timed out after 5m0s
  running tests:
    TestDNSPermanent_updateUpstream (4m54s)

Seen e.g. in #5807 and earlier PRs that trigger the macOS runner — the failure is unrelated to those PRs' code paths.

Issue ticket number and link

No tracking issue; discovered while triaging the flaky macOS CI job referenced above.

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Internal concurrency fix in client/iface. The public WGIface API surface and its behavior are unchanged; callers see the same methods with the same contract. No user-visible configuration or workflow changes.

Docs PR URL (required if "docs added" is checked)

N/A

Test

New client/iface/iface_close_test.go reproduces the deadlock with a fake WGTunDevice whose Close() blocks on a channel — simulating the wireguard-go goroutine drain. A parallel GetDevice() call would block forever under the old code; with the fix it returns immediately.

  • Without the fix: FAIL: TestWGIface_CloseReleasesMutexBeforeTunClose — GetDevice() deadlocked while WGIface.Close was closing the tun
  • With the fix: 20× go test -race -count=20 clean.

The test does not need CAP_NET_ADMIN / /dev/net/tun, so it runs on all CI environments including sandboxed containers.

…tDevice

WGIface.Close() took w.mu and held it across w.tun.Close(). The
underlying wireguard-go device waits for its send/receive goroutines to
drain before Close() returns, and some of those goroutines re-enter
WGIface during shutdown. In particular, the userspace packet filter DNS
hook in client/internal/dns.ServiceViaMemory.filterDNSTraffic calls
s.wgInterface.GetDevice() on every packet, which also needs w.mu. With
the Close-side holding the mutex, the read goroutine blocks in
GetDevice and Close waits forever for that goroutine to exit:

  goroutine N (TestDNSPermanent_updateUpstream):
    WGIface.Close -> holds w.mu -> tun.Close -> sync.WaitGroup.Wait
  goroutine M (wireguard read routine):
    FilteredDevice.Read -> filterOutbound -> udpHooksDrop ->
    filterDNSTraffic.func1 -> WGIface.GetDevice -> sync.Mutex.Lock

This surfaces as a 5 minute test timeout on the macOS Client/Unit
CI job (panic: test timed out after 5m0s, running tests:
TestDNSPermanent_updateUpstream).

Release w.mu before calling w.tun.Close(). The other Close steps
(wgProxyFactory.Free, waitUntilRemoved, Destroy) do not mutate any
fields guarded by w.mu beyond what Free() already does, so the lock
is not needed once the tun has started shutting down. A new unit test
in iface_close_test.go uses a fake WGTunDevice to reproduce the
deadlock deterministically without requiring CAP_NET_ADMIN.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 18, 2026

📝 Walkthrough

Walkthrough

This PR fixes a deadlock issue in the WGIface shutdown sequence. The Close() method was modified to release the mutex before calling tun.Close(), preventing callbacks triggered during tunnel shutdown from deadlocking when they attempt to re-acquire the mutex through WGIface methods.

Changes

Cohort / File(s) Summary
Mutex Scope Refactoring
client/iface/iface.go
Restructured (*WGIface).Close() to store the tun device reference, unlock the mutex, and then call tun.Close() outside the locked section. Prevents deadlock when shutdown callbacks invoke WGIface methods requiring the mutex.
Deadlock Prevention Test
client/iface/iface_close_test.go
Added comprehensive test TestWGIface_CloseReleasesMutexBeforeTunClose using channel-based synchronization to validate that concurrent WGIface method calls do not deadlock during Close() execution.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • pappz

Poem

🐰 A mutex held too long, a lock before the way,
Release it first, let callbacks play!
No deadlock here, just smooth control flow,
Close, unlock, and let the tunnel go! 🌊

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing a WGIface.Close deadlock caused by re-entrant GetDevice calls, which is precisely what the PR addresses.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering the bug, root cause, fix details, stack traces, test validation, and documentation rationale.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/iface/iface_close_test.go`:
- Around line 107-112: The test currently only waits for the Close() goroutine
to finish after closing tun.unblockClose, but it discards any returned error;
adjust the assertion to receive the Close result from closeDone (e.g., err :=
<-closeDone within the select) and fail the test if err is non-nil so the test
verifies both that WGIface.Close() returns and that it returns a nil/clean
error; reference the goroutine result channel closeDone and the unblock trigger
tun.unblockClose and assert the value coming back from the Close call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 728a0651-9e49-4375-9d26-756ad6aefce8

📥 Commits

Reviewing files that changed from the base of the PR and between 8ae8f20 and 764bee5.

📒 Files selected for processing (2)
  • client/iface/iface.go
  • client/iface/iface_close_test.go

Comment on lines +107 to +112
close(tun.unblockClose)
select {
case <-closeDone:
case <-time.After(2 * time.Second):
t.Fatal("WGIface.Close() never returned after the tun was unblocked")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assert the Close() result instead of discarding it.

If shutdown returns an unexpected error, this regression test currently still passes. Checking the buffered error keeps the test focused on both “no deadlock” and “clean close”.

🧪 Proposed test tightening
 close(tun.unblockClose)
 select {
-case <-closeDone:
+case err := <-closeDone:
+	if err != nil {
+		t.Fatalf("unexpected close error: %v", err)
+	}
 case <-time.After(2 * time.Second):
 	t.Fatal("WGIface.Close() never returned after the tun was unblocked")
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
close(tun.unblockClose)
select {
case <-closeDone:
case <-time.After(2 * time.Second):
t.Fatal("WGIface.Close() never returned after the tun was unblocked")
}
close(tun.unblockClose)
select {
case err := <-closeDone:
if err != nil {
t.Fatalf("unexpected close error: %v", err)
}
case <-time.After(2 * time.Second):
t.Fatal("WGIface.Close() never returned after the tun was unblocked")
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/iface/iface_close_test.go` around lines 107 - 112, The test currently
only waits for the Close() goroutine to finish after closing tun.unblockClose,
but it discards any returned error; adjust the assertion to receive the Close
result from closeDone (e.g., err := <-closeDone within the select) and fail the
test if err is non-nil so the test verifies both that WGIface.Close() returns
and that it returns a nil/clean error; reference the goroutine result channel
closeDone and the unblock trigger tun.unblockClose and assert the value coming
back from the Close call.

@pappz pappz merged commit e361126 into netbirdio:main Apr 20, 2026
45 of 49 checks passed
@MichaelUray MichaelUray deleted the fix/wgiface-close-deadlock branch April 25, 2026 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants