[client] Fix ICE reconnection loop and buffer candidates before agent initialization#5805
[client] Fix ICE reconnection loop and buffer candidates before agent initialization#5805MichaelUray wants to merge 3 commits intonetbirdio:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdded an agent loop-protection window and an Changes
Sequence DiagramsequenceDiagram
participant Client
participant WorkerICE
participant Agent
Client->>WorkerICE: OnNewOffer(remoteSessionID)
WorkerICE->>WorkerICE: if w.agent != nil && agent.State == Connecting
alt agent.SessionID != remoteSessionID and (now - w.agentStartTime) < agentLoopProtectionWindow
WorkerICE-->>Client: return (skip replacing recent connecting agent)
else
WorkerICE->>Agent: Close old agent (if exists)
WorkerICE->>Agent: Create new agent
WorkerICE->>WorkerICE: w.agentStartTime = time.Now()
Agent-->>WorkerICE: Agent enters Connecting
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
client/internal/peer/worker_ice.go (1)
184-189: Consider adding a buffer size limit as a defensive measure.The buffer could theoretically grow unbounded if candidates keep arriving without agent creation (e.g., due to signaling issues). While unlikely in practice, a simple cap would prevent memory accumulation in pathological cases.
♻️ Optional: Add buffer cap
+const maxPendingCandidates = 50 + func (w *WorkerICE) OnRemoteCandidate(candidate ice.Candidate, haRoutes route.HAMap) { w.muxAgent.Lock() defer w.muxAgent.Unlock() w.log.Debugf("OnRemoteCandidate from peer %s -> %s", w.config.Key, candidate.String()) if w.agent == nil { + if len(w.pendingCandidates) >= maxPendingCandidates { + w.log.Warnf("pending candidate buffer full, dropping candidate: %s", candidate.Type()) + return + } w.log.Infof("ICE Agent not ready, buffering remote candidate: %s", candidate.Type()) w.pendingCandidates = append(w.pendingCandidates, candidate) return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/peer/worker_ice.go` around lines 184 - 189, The pendingCandidates slice can grow unbounded when w.agent == nil; add a fixed cap (e.g., MAX_PENDING_CANDIDATES constant) and enforce it in the buffering branch inside the OnNewCandidate / handler where w.pendingCandidates is appended: if len(w.pendingCandidates) >= MAX_PENDING_CANDIDATES, either drop the oldest (shift) or drop the new candidate and log a warning via w.log.Infof/warnf, otherwise append; ensure the constant and behavior are documented in comments and referenced where pendingCandidates is created/cleared (use symbols w.pendingCandidates, w.agent, candidate).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 184-189: The pendingCandidates slice can grow unbounded when
w.agent == nil; add a fixed cap (e.g., MAX_PENDING_CANDIDATES constant) and
enforce it in the buffering branch inside the OnNewCandidate / handler where
w.pendingCandidates is appended: if len(w.pendingCandidates) >=
MAX_PENDING_CANDIDATES, either drop the oldest (shift) or drop the new candidate
and log a warning via w.log.Infof/warnf, otherwise append; ensure the constant
and behavior are documented in comments and referenced where pendingCandidates
is created/cleared (use symbols w.pendingCandidates, w.agent, candidate).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: aabe6227-0bce-4e6d-abac-3b3c6358dde1
📒 Files selected for processing (1)
client/internal/peer/worker_ice.go
28e3a06 to
64e91ae
Compare
Addresses coderabbitai review on PR netbirdio#5805: the pendingCandidates slice could theoretically grow unbounded if remote candidates keep arriving before the ICE agent is created (e.g. due to signaling issues). Add a fixed cap of 50 candidates and drop further candidates with a warning log to protect against memory accumulation in pathological scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 68-69: Buffered remote candidates stored in pendingCandidates
bypass the normal pipeline (candidateViaRoutes and shouldAddExtraCandidate),
causing different behavior depending on arrival timing; fix by applying the same
routing/filtering before buffering and by replaying buffered entries through the
same helper used by the live path: call candidateViaRoutes() and run
shouldAddExtraCandidate() (or the unified helper that encapsulates both) for
each candidate before pushing to pendingCandidates, and when draining
pendingCandidates later invoke the same AddRemoteCandidate() path used for live
arrivals so routed/filtered/synthetic-related-port candidates are handled
identically (update functions: pendingCandidates, candidateViaRoutes,
shouldAddExtraCandidate, AddRemoteCandidate).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 73288d66-2438-465a-b0b8-92d5cf6eed4e
📒 Files selected for processing (1)
client/internal/peer/worker_ice.go
|
@MichaelUray Thank you for the PR. The race you're protecting against — OnRemoteCandidate arriving before the agent is created — requires a very specific timing: the remote peer would need to create its agent, start gathering, discover a candidate, and signal it back, all before our handshaker goroutine reads the offer from its channel and calls OnNewOffer. That's microseconds (channel read) vs tens-to-hundreds of milliseconds (STUN/TURN network I/O for candidate gathering), so in practice this almost never happens. If we ever wanted to truly fix this race, the correct approach would be to remove the go from engine.go:1612: But of course it is a risky change. |
|
The session ID mechanism exists for two specific reasons:
By skipping all offers with a different session ID while |
…andidates Address coderabbitai review on PR netbirdio#5805: previously buffered candidates bypassed candidateViaRoutes() (route-loop filter) and shouldAddExtraCandidate() (extra server-reflexive helper), so a candidate arriving before the agent existed could be added later in a state inconsistent with a candidate arriving after the agent existed. Worse, the buffered path missed the extra-srflx-with-related-port candidate that the live path adds. Refactor OnRemoteCandidate to compute the full list of candidates that should reach the agent (the routed-out filter and the optional extra srflx) before deciding whether to buffer or add immediately. The same filter and extension now apply unconditionally; the only difference between the two paths is whether we hand the result to the agent right now or stash it for the next flush. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
client/internal/peer/worker_ice.go (1)
238-247: Minor: Log message may undercount dropped candidates.When the buffer is full, the
returnexits the function immediately. IftoAddcontains multiple candidates (original + extra srflx), all remaining candidates in the slice are dropped but only one is logged.This is a minor logging clarity issue and doesn't affect correctness.
💡 Optional improvement for logging accuracy
for _, c := range toAdd { if len(w.pendingCandidates) >= maxPendingCandidates { - w.log.Warnf("pending candidate buffer full (%d), dropping candidate: %s", - maxPendingCandidates, c.Type()) + w.log.Warnf("pending candidate buffer full (%d), dropping %d candidate(s)", + maxPendingCandidates, len(toAdd)-len(w.pendingCandidates)+maxPendingCandidates) return }Or simply break out of the loop and log once after counting drops.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/internal/peer/worker_ice.go` around lines 238 - 247, The loop over toAdd currently returns on first overflow, underreporting dropped candidates; change the logic in the worker (the loop that appends to w.pendingCandidates) to stop iterating when len(w.pendingCandidates) >= maxPendingCandidates (use break instead of return), compute how many candidates from toAdd were not buffered (e.g., dropped := len(toAdd) - i or maintain a dropped counter), and emit a single w.log.Warnf that reports the total number of dropped candidates along with maxPendingCandidates and candidate types if desired; ensure you still append buffered candidates to w.pendingCandidates and retain the info log w.log.Infof for buffered entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 157-165: The current guard w.agentConnecting unconditionally skips
new offers while an ICE attempt is in progress, which avoids reconnection loops
but can delay recovery when a remote peer crashes and reconnects with a new
session; change this by making the blocking behavior conditional or timeboxed:
add a configurable short timeout or an early-abort check that compares the
incoming remoteOfferAnswer.SessionIDString() to the session involved in the
current ICE attempt and, if they differ, either abort the ongoing attempt
(invoke the existing ICE cancel/cleanup routine) or allow the new offer to
preempt after a configurable threshold (e.g., w.iceConnectingTimeout or a method
like abortConnectingIfNewSession(remoteSessionID)); update logs via
w.log.Infof/w.log.Warnf to record preemption/timeout decisions.
- Around line 180-184: The current call to NewICESessionID() logs an error but
leaves w.sessionID unchanged on failure; update the error handling so you do not
keep a stale ID: only assign w.sessionID = sessionID when err == nil, and on err
either set w.sessionID = "" (or a known-invalid value) or return an error from
the enclosing function; modify the block around NewICESessionID() so it checks
err, clears w.sessionID on failure (or returns the error) and logs the failure
with context before exiting.
---
Nitpick comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 238-247: The loop over toAdd currently returns on first overflow,
underreporting dropped candidates; change the logic in the worker (the loop that
appends to w.pendingCandidates) to stop iterating when len(w.pendingCandidates)
>= maxPendingCandidates (use break instead of return), compute how many
candidates from toAdd were not buffered (e.g., dropped := len(toAdd) - i or
maintain a dropped counter), and emit a single w.log.Warnf that reports the
total number of dropped candidates along with maxPendingCandidates and candidate
types if desired; ensure you still append buffered candidates to
w.pendingCandidates and retain the info log w.log.Infof for buffered entries.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8d2c91f1-409c-45ea-93ba-b1dd7570a51f
📒 Files selected for processing (1)
client/internal/peer/worker_ice.go
e31c5de to
377093f
Compare
|
Thanks @pappz for the thorough review! I've reworked the PR based on your feedback: Candidate buffering — removed. You're right that the race is practically impossible given the timing constraints (microsecond channel read vs hundreds of ms STUN/TURN I/O). All three buffering commits have been dropped. Session-ID skip during agentConnecting — replaced with a time-windowed guard. Instead of unconditionally skipping offers while connecting, the new approach only skips offers with a different session ID during the first 3 seconds after agent creation. This addresses both concerns:
The 3s window aligns with the guard's One factual note: the ICE FailedTimeout in this codebase is 6 seconds (see Tested on a real Android device (Galaxy S26) with multiple peers in the same LAN (same public IP) — direct P2P connections establish reliably without the reconnection loop. |
377093f to
7720145
Compare
|
Addressed the coderabbitai review on the candidate buffer path in commit 7720145 — buffered candidates now flow through the same |
|
@pappz friendly ping — all review feedback has been addressed. Ready for another look whenever you have time. Thanks! |
…OnRemoteCandidate synchronously Per pappz's review on PR netbirdio#5805: 1. Remove the candidate buffer entirely. The race between OnRemoteCandidate and OnNewOffer requires the remote peer to complete STUN gathering and signal a candidate back in microseconds (channel read time), which is far faster than the tens-to-hundreds of milliseconds of network I/O needed for candidate discovery. In practice this almost never happens. 2. Remove the 'go' prefix from conn.OnRemoteCandidate() in engine.go so candidates are processed synchronously in the signaling receive loop. This is the correct fix for the theoretical race: candidates now arrive in the same goroutine as offers, eliminating the ordering issue without needing a buffer. The agentLoopProtectionWindow guard (3s) is kept as-is since it correctly handles both the reconnection loop scenario (same-NAT peers) and crash recovery (new session ID accepted after the window expires).
|
@pappz Following up with detailed test results on both the candidate buffer and the synchronous OnRemoteCandidate change. Candidate buffer: confirmed unnecessaryI deployed the current PR branch (without buffer, with synchronous OnRemoteCandidate) to a Cudy TR3000 router (OpenWrt, ARM64) and tested against multiple peers:
The "ICE Agent is not initialized yet" warning still appears (even with synchronous OnRemoteCandidate — see below), but P2P works regardless. The lost candidates are re-signaled on subsequent guard rounds. You were right that the buffer has no practical impact. Synchronous OnRemoteCandidate: doesn't fix the race, but doesn't hurtThe race condition isn't at the goroutine level — it's at the signaling message level. The remote peer sends host candidates (instant, no STUN needed) before sending its answer. In the Host candidates don't need STUN/TURN — they're discovered instantly when the agent is created. So the remote peer signals them before signaling the answer. Making That said, making it synchronous is still a reasonable cleanup (removes unnecessary goroutine spawn), so I've kept that change. The 1040 lost candidates from uray-mic-d4These are from the guard loop on the stock v0.67.1 client — it sends new offers every ~12-30s, each with a new session ID. Each time, the remote peer's host candidates arrive before the answer. The time-windowed guard (3s) on our side prevents the reconnection loop stomping, but the old client keeps re-triggering. The lost candidates accumulate but don't prevent P2P once both sides have the fix. uray-mic-d4 stays Relayed not because of lost candidates, but because it's behind a different NAT and its guard loop prevents ICE from completing (same root cause this PR fixes). Current PR state
Ready for another look whenever you have time. |
|
@MichaelUray Could you send a debug bundle from both peers where you highlight the exact sequence that demonstrates the problem you want to fix? Ideally I'd like to see the OnNewOffer lines with session IDs across 3-4 guard cycles on both sides, so we can trace where the session-ID dedup fails. |
|
Hi @pappz, here is the debug evidence you requested. The analysis below was captured on stock Reproduction environmentTwo peers, same LAN, same public IP. IPv6 + IPv4 peer-to-peer UDP blocked at netfilter/WFP to force ICE FailedTimeout (relay/signal/management paths untouched). Without that block ICE succeeds fast enough that PR #5805 – Session-ID Dedup Failure EvidenceSetup
The "force ICE failure" step is important: on a happy-path same-LAN pair Log fields
Full chronological traceCaptured on stock Session-ID timeline (7 cycles in 109s)
Every ~12s (= Where session-ID dedup failsEvery i.e. the dedup check if w.remoteSessionID == *remoteOfferAnswer.SessionID {
// skip
}can never match: That is the core of the loop:
Because the rotation on both sides is triggered by the same global signal Does the session-ID mechanism fulfil its two documented purposes?Yes, both still work inside a single agent lifetime. The merged trace at Current PR #5805 behaviourWith Files in this bundle
The instrumentation patch (against |
|
Thank you for your analysis. I tried to reproduce the issue to confirm the original report. In a lab environment, I introduced 1, 1.5, 3, etc. second delays in the connection. The P2P connection always succeeded and never got stuck in a loop. Could you please provide instructions on how I can repeat your tests? |
…public IP When two peers are behind the same NAT (same public IP), ICE P2P connections could never be established because of a reconnection loop: 1. Guard on peer A sees no P2P connection and sends a new offer 2. Peer B receives the offer with a new session ID 3. Peer B cancels its active ICE agent and creates a new one 4. Before the new ICE agent can complete connectivity checks, peer B's guard sends a new offer to peer A 5. Peer A cancels its ICE agent → cycle repeats indefinitely The fix adds a short protection window (3 seconds) after agent creation during which new offers with different session IDs are skipped. This is long enough to prevent the guard-vs-guard race (guards fire every 3s) but short enough to not delay crash recovery — the ICE FailedTimeout is only 6 seconds, so at worst recovery is delayed by 3s. This replaces the earlier unconditional skip approach based on reviewer feedback (pappz): unconditional blocking during agentConnecting would delay crash-recovery by the full ICE timeout (6s+). The time-windowed approach preserves the session-ID mechanism for crash detection while eliminating the reconnection loop. Candidate buffering (previously in this PR) has been removed per reviewer feedback — the race condition it protected against is practically impossible given the timing constraints. Fixes: netbirdio#3669 Related: netbirdio#2703 Related: netbirdio#3339
…ssion ID handling - Add pendingCandidates buffer for remote candidates arriving before ICE agent initialization (bounded to 50 entries) - Route candidateViaRoutes check before buffering so routed candidates are filtered identically regardless of arrival timing - Extract addRemoteCandidate helper used by both live and buffered paths ensuring shouldAddExtraCandidate logic runs for all candidates - Clear stale session ID (set to "") when NewICESessionID() fails in handleExistingAgent and closeAgent instead of keeping previous value - Clear pendingCandidates on agent close to prevent stale data Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…OnRemoteCandidate synchronously Per pappz's review on PR netbirdio#5805: 1. Remove the candidate buffer entirely. The race between OnRemoteCandidate and OnNewOffer requires the remote peer to complete STUN gathering and signal a candidate back in microseconds (channel read time), which is far faster than the tens-to-hundreds of milliseconds of network I/O needed for candidate discovery. In practice this almost never happens. 2. Remove the 'go' prefix from conn.OnRemoteCandidate() in engine.go so candidates are processed synchronously in the signaling receive loop. This is the correct fix for the theoretical race: candidates now arrive in the same goroutine as offers, eliminating the ordering issue without needing a buffer. The agentLoopProtectionWindow guard (3s) is kept as-is since it correctly handles both the reconnection loop scenario (same-NAT peers) and crash recovery (new session ID accepted after the window expires).
d4d191b to
90b59d9
Compare
|
|
@pappz Thank you for the thorough review and the back-and-forth on this PR. Your feedback shaped the implementation considerably — dropping the candidate buffer, replacing the unconditional session-ID skip with a time-windowed guard, and pushing for proper repro evidence all made the change much more defensible than what I originally opened. After today's comprehensive re-testing on real hardware, I'm closing this PR. Here's the analysis. Test setup (2026-04-25)Six Cudy TR3000 routers (OpenWrt 25.12.2-FWR, aarch64, kmod-wireguard / userspace WG) joined into one NetBird group with an
Three rounds, identical setup, only the binary swapped. Results
ICE Connected was 0 in all rounds (UDP fully blocked, all paths relayed — that's by design; the test isolates the reconnect-loop dynamics, not the relay/p2p selection). What this means
Closing rationaleGiven that:
it doesn't make sense to keep this open as a candidate for merge. If the original lockstep pattern resurfaces on a future build, the right starting point will be a fresh investigation against then-current For the lab repro you asked aboutThe reason artificial latency injection didn't reproduce the loop in your lab is that the trigger isn't latency — it's complete UDP loss between the two peer LAN endpoints (so all ICE candidates fail FailedTimeout) combined with both peers being on the same public NAT (so neither side ever falls back into a "different routes" branch that would break symmetry). Reproduction recipe for posterity: # On both peers in the same LAN, behind the same NAT:
nft add table inet ice_test
nft 'add chain inet ice_test input { type filter hook input priority -100; }'
nft 'add chain inet ice_test output { type filter hook output priority -100; }'
nft "add rule inet ice_test input ip saddr <other_peer_lan_ip> meta l4proto udp drop"
nft "add rule inet ice_test output ip daddr <other_peer_lan_ip> meta l4proto udp drop"
# Disable IPv6 discovery so it doesn't paper over the v4 failure:
sed -i 's|"DisableIPv6Discovery": false|"DisableIPv6Discovery": true|' /etc/netbird/config.jsonOn stock 0.67.4 you'll see lockstep Thanks again for sticking with this — both for the technical pushback that made me prove the claims properly, and for getting #5820 across the line. — Michael |



Summary
agentConnecting == trueto let the current ICE attempt complete.pendingCandidatesbuffer that collects early candidates and flushes them once the agent is ready.Both fixes are tightly coupled — without the buffer, the guard loop fix alone would still miss early candidates.
Checklist
By submitting this pull request, I confirm that I have read and agree to the terms of the Contributor License Agreement.
Related Issues
Fixes #3669 — Unable to connect to netbird peer on different LAN with the same public IP
Fixes #5672 — ICE Agent is not initialized yet on android app
Related #2703 — Multiple peers behind NAT getting relayed
Related #3339 — Same LAN cannot successfully P2P
Related #4031 — No P2P/direct connection, only relayed
Related #4225 — Netbird Self-Hosted P2P not working
Related #5670 — macOS ICE agent never initializes (arm64)
Summary by CodeRabbit