Feature/bbr v3 by ericsssan · Pull Request #1 · ericsssan/zquic

ericsssan · 2026-03-22T03:02:28Z

No description provided.

Move bytes_in_flight from queue-time to wire-time so pacing works naturally. Add SendMeta ring buffer to defer onPacketSent() to send(), where wall-clock time is available. The cwnd check uses bytes_in_flight + bytes_queued to maintain the same gate. Pacing gate in send() uses token bucket (refill/consume) to smooth CUBIC bursts that overflow shallow queues. nextTimeout() includes pacing deadline so the event loop wakes to drain paced packets. Fix stream send buffer stall: SACK ranges (out-of-order ACK tracking) silently dropped entries when the 32-slot array was full during loss cascades. Add merge-on-insert for adjacent ranges, and coalesce the two closest ranges when full — no ACK info is ever silently lost. Increase MAX_PENDING_RETX from 32 to 128 to handle burst losses when pacing keeps the send queue non-empty during loss detection. Transfer interop test: 10/10 pass (was ~67%). Full interop: 22/22.

acked_frames[64] could overflow when a single ACK covers many packets (e.g., after loss recovery with 100+ in-flight packets). Excess frame info was silently dropped, preventing send_acked from advancing — same class of permanent stream buffer stall as the SACK overflow bug. Split MAX_LOSS_EVENTS (64, for lost_frames which has its own defer mechanism) from MAX_ACKED_FRAMES (128, matching epoch 2's sent buffer of MAX_SENT/2 slots). This guarantees an ACK covering all in-flight packets for any single epoch never overflows the acked_frames buffer.

findConnByDcid now checks first_initial_dcid in addition to local_cid and alt_local_cid. Under packet loss, clients retransmit their Initial using the original random DCID (they haven't received the server's SCID yet). Without this check, the server treated retransmissions as new connections, creating duplicates that caused "Expected 50 handshakes, Got: 51" failures in the handshakeloss interop test.

RFC 9000 §12.2: consecutive long-header packets (epoch 0/1) are now appended to the same output buffer in send(), producing one UDP datagram instead of two. Under 30% packet loss, this halves the probability of losing handshake data (one 30% chance vs two independent 30% chances = 51%). Handshakeloss interop test: 8/8 pass (was 7/8).

Append HANDSHAKE_DONE (first 1-RTT packet) to the coalesced Initial+Handshake datagram so the entire handshake response is a single loss event. Stop after the first 1-RTT to preserve pacing for data packets. Handshakecorruption: 7/8 pass (was 3/4).

Coalescing 1-RTT packets with Handshake packets caused a deterministic stall at ~315KB during connection migration. Restrict coalescing to epoch 0+1 (Initial+Handshake) only.

- Add build-time congestion algorithm selection (-Dcongestion=cubic|bbr) - Add BBR v3 implementation (bbr.zig) with Startup, Drain, ProbeBW, ProbeRTT phases, windowed bandwidth/RTT filters, and pacing - Add congestion control abstraction layer (cc.zig) for comptime switch - Move Dockerfile to interop/, add .dockerignore - Update ECN test for BBR compatibility (inflight_hi vs cwnd check) - Update interop-test.sh for new Docker path

Under high packet loss, the client's packets may never arrive to trigger receive(). The server's buffered Handshake CRYPTO (partial cert chain blocked by amplification limit) was only flushed in receive(), leaving it unsent indefinitely. Now tick() also flushes it, so PTO cycles can deliver the remaining handshake data even when all client packets are lost.

Wire-time pacing inflates send_elapsed in the delivery rate formula, causing BBR to underestimate bandwidth (15 KB/s on a 1.25 MB/s link). Store queued_ns in SentPacket and use it for delivery rate snapshots while keeping sent_ns (wire-time) for loss detection. Add shouldPace() to CC interface — BBR bypasses pacing gate during Startup to avoid negative feedback loop where low initial estimate throttles sends. BBR interop: 14/22 → improved from 13/22. Transfer stall reduced from 69 KB to 124 KB. Further BBR debugging needed.

BBR's pacing rate started at 0 and stayed there until the first ACK set max_bw. After the first ACK, rate was set from a low initial bandwidth estimate, throttling sends and preventing BBR from probing link capacity (negative feedback loop: low estimate → slow pacing → low delivery rate → low estimate). Bootstrap rate to initial_cwnd / initial_rtt × startup_gain (~4.2 MB/s) so the first burst is paced at a reasonable rate. BBR interop: 15/22 (was 13/22). Transfer stall at 161 KB needs further investigation.

- Drain: use BDP×cwnd_gain target instead of locking cwnd to inflight_hi (trapped BBR permanently in Drain) - Delivery rate: use queue-time (not wire-time) for send_elapsed to prevent pacing from depressing bandwidth estimates - Pacing refill: only advance last_refill_ns when time elapsed, preventing event loop busy-spin on stale deadline - ACK skip-ahead: when pacing blocks the head of send queue, scan for non-ack-eliciting packets (ACKs) and swap them to the front so the server always responds to client packets - Bootstrap BBR initial pacing rate to avoid Startup throttle - shouldPace(): bypass pacing gate during BBR Startup BBR interop: 13/22 (was 13/22, transfer 71→227 KB). Further work needed on BBR's Startup burst causing 55% loss on shallow queues. CUBIC still 22/22.

- shouldPace() now always returns true for BBR — the bootstrapped initial pacing rate (4.2 MB/s) prevents the low-estimate feedback loop while still smoothing bursts - Increase SEND_BUF_SIZE from 64KB to 128KB — BBR Startup cwnd peaks at 200KB / 3 streams = 67KB per stream, exceeding the 64KB buffer and causing permanent BufferFull stalls BBR interop still at 13-14/22. Remaining issue: BBR Startup 2.885× pacing gain inherently overflows 25-packet queues (queue fills in 12ms). Recovery works but is slow. Need to either reduce Startup aggressiveness or improve post-loss recovery speed.

…eRTT - enterDrain: reduce inflight_hi to BDP (not Startup peak) when excessive loss triggered Startup exit - updateDrain: apply loss bounding during Drain (was only in ProbeBW) - enterProbeRtt: reset min_rtt to force re-measurement (Linux behavior) - shouldPace: always true — enforce pacing during Startup using bootstrapped rate to prevent queue overflow - Revert stream buffer to 64KB (128KB made BBR worse — deeper hole) BBR interop: 13/22. Core issue: BBR Startup pacing gain (2.885×) inherently overflows 25-packet queues. Bootstrapped rate of 1.45 MB/s exceeds 1.25 MB/s link, filling queue in 12ms. Post-loss recovery can't keep up with 64KB stream buffers. Needs Startup redesign for shallow-queue environments. CUBIC: 22/22 unaffected.

- Startup: check isExcessiveLoss on every ACK (not just round_start) to exit early before cwnd inflates from 58KB to 200+KB - Drain/ProbeBW DOWN: use 1.0× pacing gain instead of 0.346×/0.9× to match CUBIC's recovery speed (retransmissions in 35ms not 211ms) - Pacing gate: bypass when bif=0 to get packets on wire urgently - PTO: force-arm when bytes_queued > 0 and bif = 0 - ACK skip-ahead: scan queue for non-ack-eliciting packets when pacing blocks the head packet BBR interop: 15-16/22 (was 13/22). Transfer 69→244 KB. Remaining: server dies after ~1.5s because stream buffers fill (64KB unacked) and all retransmissions complete. Need mechanism to continue loss detection after bif reaches 0. CUBIC: 22/22 unaffected.

…d packets BBR congestion control: - Remove min_rate_floor filter from onAckReceived; the 100-round BW filter window prevents death spiral without rejecting valid samples - Add pacing rate floor (INITIAL_CWND / min_rtt) in updatePacingRate - Preserve max_bw across persistent congestion so BBR recovers immediately after blackhole instead of re-probing from zero Path migration (rebind-addr, rebind-port, connectionmigration): - declareEpochLost: on migration, declare all in-flight 1-RTT packets lost and queue their stream frames for retransmission - Preserve cwnd across migration to avoid throughput collapse - Reset RTT estimator and PTO count on migration - Track prev_peer_addr to suppress re-migration from late old-path packets - moveLastToFront: PATH_CHALLENGE bypasses pacing-blocked data - Server: sync peer_addr from connection on path_migrated event Coalesced packet handling (handshakecorruption): - skipLongHeaderPacket: skip one unprocessable packet without dropping the entire datagram, so coalesced Handshake/1-RTT packets proceed - Accept client's switched DCID (local_cid/alt_local_cid) in Initial validation during handshake Send queue and loss recovery: - Unconditional storeSendMeta fixes stale metadata for non-tracked packets - deferStreamRetx helper eliminates duplicate pending retx logic - Wire-time timestamps for delivery rate (sent_ns, not queued_ns) - Retransmission cwnd cap prevents bytes_queued from exceeding cwnd Server fixes: - hq-interop: send FIN on file-not-found instead of silent deactivation - Increase test stack to 64MB (Connection struct overflow in Debug mode) - Fix PMTUD test to check send-queue metadata instead of loss.sent

The first delivery rate sample carried the RTT estimator's bootstrap value (K_INITIAL_RTT = 10ms) before any real measurement existed. BBR accepted this as min_rtt, making BDP = max_bw × 10ms ≈ 12KB instead of max_bw × 32ms ≈ 38KB. With BDP too small, the Drain exit condition (inflight ≤ BDP) never triggered, trapping BBR in Drain permanently and collapsing goodput to ~2.8 Mbps on a 10 Mbps link. Fixes: - Reject rtt_ns == K_INITIAL_RTT_NS (exact 10ms) in BBR min_rtt update - Return rtt_ns = 0 in delivery rate sample when RTT estimator is uninitialized, preventing the bootstrap placeholder from propagating

…le recovery Goodput improved from 2.8 Mbps to 9.4 Mbps on 10 Mbps link by interleaving drainSend inside flushTransfers. Without this, the cwnd check saw bytes_in_flight=0 during the fill phase, starving BBR's pipe. Server event loop (tools/server.zig): - flushTransfers now takes send socket params and calls drainSend after each round-robin pass, keeping bytes_in_flight current - Unconditional drainSend after the loop ensures PATH_CHALLENGE, ACKs, and retransmissions are always flushed even when all transfers are blocked (buffer full, amplification limit) Path migration (connection.zig): - Preserve min_rtt across migration — resetting to 10ms default caused time-loss thresholds to fire before retransmitted packets could be ACKed on a 30ms path - Reset bytes_in_flight to 0 instead of declareEpochLost — proactive retransmission of all in-flight packets caused 3x amplification (many were already received by the client, ACKs still in transit) - Arm PTO immediately after migration so retransmissions of truly lost packets are handled by normal loss detection Loss detection (connection.zig): - Skip epoch 0/1 (Initial/Handshake) in time-loss detection when established — keys are zeroed after handshake, retransmit panics BBR (bbr.zig): - applyLossBounding floor at max(bdp, BBR_MIN_CWND) prevents inflight_hi spiral after blackhole recovery

The application-level token bucket depletes after the initial 10-packet burst. With pacing enforced (bytes_in_flight > 0), the server is locked to drip-feeding 1 packet per ACK — the token refill rate can't keep up with the bursty send pattern needed for bandwidth discovery. A ~10ms ACK delay (Docker VM jitter) is enough to collapse throughput from 9.4 Mbps to <1 Mbps. Bypass pacing during Startup (filled_pipe == false), matching TCP slow start behavior. The cwnd still limits total in-flight data. Once Startup completes and filled_pipe is set, pacing is enforced for steady-state fairness. Goodput: 9430 (±3) kbps — stable across 5 runs. Crosstraffic: 6.1–6.6 Mbps. Interop: 22/22.

At steady state, bytes_in_flight naturally sits at ~BDP. The strict <= bdp() check for exiting DOWN fails by a fraction of a packet (e.g. bif=32798 vs BDP=32297), permanently trapping BBR in DOWN. max_bw then decays over its 100-round filter window (~3s), causing throughput to collapse from ~9 Mbps to <200 kbps. This manifested as flaky transfer timeouts: Startup ramps successfully (9+ Mbps), but once ProbeBW takes over, the stuck DOWN phase causes a gradual decline (seconds 4-7) followed by collapse (second 8+). Fix: add 1 MSS headroom to DOWN and Drain exit conditions so the check fires reliably even when bif ≈ BDP. Transfer test: 3/3 passes (was intermittently timing out). Full suite: 22/22 pass individually. Goodput: 9102-9430 kbps.

…loss After the first packet on the CM socket, use_cm_sock was set to true and never reset. When the client rebinds back to the original path (or the sim stops NAT'ing through CM), the server kept sending via the CM socket — which can't route to clients on the original network. This caused rebind-port and rebind-addr to fail intermittently in full suite runs: ~75% of the file transferred on the original socket, then the remaining 25% sent via CM socket to an unreachable address. Fix: track the CURRENT socket per packet instead of a one-way flag. Also use the incoming packet's socket for responses in processPacket rather than the global flag, so original-socket ACK responses aren't misrouted through the CM socket. Full suite: 22/22 × 2 consecutive runs. G: 9429 (±2) kbps.

- Replace inlined active_sock logic with slotSendSock (use_cm_sock is now assigned before the call, so slotSendSock returns the correct per-packet socket) - Consolidate bdp()+MSS comments: explain the WHY once at Drain exit, reference it from ProbeBW DOWN - Trim redundant first line from shouldPace doc comment - Update stale use_cm_sock field comment to reflect bidirectional tracking

…tests - bdpHeadroom(): use max(MSS, bdp/32) so headroom scales from ~1 MSS at 10 Mbps to ~12 MB at 100 Gbps (where MSS alone is negligible) - use_cm_sock: conditional write (only on change) to avoid dirtying the cache line on every packet in the common no-migration case - Add 5 unit tests: shouldPace Startup/Drain/persistent-congestion, ProbeBW DOWN headroom (exact/near/far BDP), Drain headroom, bdpHeadroom scaling at 10 Mbps vs 100 Gbps Full suite: 22/22. G: 9430 (±7) kbps.

The Drain and ProbeBW DOWN exit condition compared prior_inflight (pre-ACK) against BDP. But in an application-level stack, the server refills to cwnd (≈2×BDP) between ACKs, so prior_inflight ≈ 2×BDP and the check can never pass — trapping BBR permanently. Fix: use `prior_inflight - bytes_acked` (post-ACK inflight), which reflects the actual pipe depth after draining. The math is exact: at steady state, prior=2×BDP, acked=BDP → post=BDP → exits. After UP overshoot, prior=2.5×BDP, acked=BDP → post=1.5×BDP → waits. No headroom constants needed; scales from 10 Mbps to 100 Gbps. Also: conditional use_cm_sock write (only on change), 4 unit tests for shouldPace lifecycle and post-ACK inflight Drain/DOWN exit. Full suite: 22/22. G: 9432 (±1) kbps.

1. Path migration bif desync (connection.zig + loss_recovery.zig): onPathMigration set bytes_in_flight=0 but old packets kept in_flight=true. When later ACKed, saturating subtract drove bif below actual new-path inflight, killing PTO. Fix: clearInflight() marks all existing sent packets as not-in-flight. 2. activatePending stale h3_headers_sent (server.zig): Reused transfer slot kept h3_headers_sent=true from previous transfer, skipping HEADERS frame on new H3 request. Fix: reset the flag when activating a pending transfer. 3. sendH3ControlStreams duplicate on retry (server.zig): Partial success (stream 3 sent, stream 7 failed) re-sent stream 3 on retry, duplicating control stream data. Fix: check send_offset to skip already-sent streams. 4. allocateSlot memory leak (server.zig): Missing errdefer if Conn.accept fails after page_allocator.create. Full suite: 22/22 (C1 flaky, passes on retry). G: 9430 (±2) kbps.

…ed corrupted The PTO handler branched on `app_keys != null` to choose between post-handshake (PING/stream probes) and handshake (CRYPTO retransmit) paths. But app_keys are derived when the server sends its own Finished — BEFORE receiving the client's Finished. When the client's Handshake Finished was corrupted (30% corruption test), the server had app_keys but wasn't established. PTO sent PINGs instead of retransmitting its Handshake response. Without the Handshake retransmit, the client never retransmits its Finished, HANDSHAKE_DONE is never sent, and the client never sends the HTTP request. Connection idles out. Fix: branch on `state == .established` instead of `app_keys != null`. Full suite: 22/22. G: 9432 (±1) kbps.

… per packet Previously: encode header into enc_scratch → encrypt pkt_scratch into enc_scratch → memcpy enc_scratch into sq[].buf (1452 bytes per packet). Now: reserve the next send queue slot via reserveSendSlot(), encode header and encrypt directly into sq[].buf, then commitSendSlot(). Eliminates one full-packet memcpy on the hot path for all three packet types (1-RTT STREAM, Initial CRYPTO, Handshake CRYPTO). enqueueSend() is preserved as a fallback for callers that build packets in scratch buffers (ACKs, CONNECTION_CLOSE, VERSION_NEG). Full suite: 22/22. G: 9427 (±8) kbps.

Server Initial datagrams were NOT padded to the 1200-byte minimum required by RFC 9000 §14.1. Coalesced Initial+Handshake datagrams were only ~920 bytes, causing the Handshake CRYPTO (cert chain + CertVerify + Finished) to be split across multiple packets. At 30% loss (handshakeloss test), each additional Handshake packet consumed amplification budget. 4 PTO retransmits of the split packets exhausted the 3× budget, leaving nothing for the CRYPTO tail. The handshake stalled permanently. With 1200-byte padding, the Handshake CRYPTO fits in the coalesced datagram alongside the Initial. Fewer separate packets needed, less budget consumed, handshake completes reliably. handshakeloss: passes (was flaky ~20% failure rate). handshakecorruption: passes. Full suite: 22/22. G: 9430 (±4) kbps.

When a new best value entered the filter, all three entries (best, second, third) were set to the same value AND the same round. After `window` rounds, all three expired simultaneously. The filter collapsed to whatever the current sample was — often a low value during a transient dip — and max_bw never recovered. This caused the transfer test (3-stream, 10MB) to collapse at ~8s: the Startup peak (1.29M, slightly inflated from unpaced burst) set all three filter entries at the same round. ProbeBW DOWN samples (~1.22M, below the inflated peak) never entered the filter. After 100 rounds, all three expired → max_bw dropped to ~700K → cwnd shrank → throughput collapsed to 20 pkt/s. Fix: demote old entries instead of resetting. When a new best arrives, shift best→second→third. This preserves entries from different rounds, so when the best expires, the second-best (from a more recent round) takes over instead of collapsing. Transfer test: passes (was ~20% flaky). Full suite: 22/22. G: 9425 (±8) kbps.

Each connection's writeKeyLog called createFileAbsolute (which truncates) then wrote at offset 0. Only the last connection's keys survived. tshark couldn't decrypt failing connections. Fix: accumulate all keys in a global 64KB buffer, rewrite the entire file on each update. 50 connections × 4 lines ≈ 40KB.

ericsssan added 30 commits March 18, 2026 17:34

fix: revert 1-RTT coalescing — breaks connection migration

d4a5968

Coalescing 1-RTT packets with Handshake packets caused a deterministic stall at ~315KB during connection migration. Restrict coalescing to epoch 0+1 (Initial+Handshake) only.

docs: update README — BBR v3, pacing, coalescing, 22/22 interop

157c44e

style: fix zig fmt trailing blank line in bbr.zig

08f0daf

docs: update interop goodput to 9429 kbps

fa83024

docs: update interop results with 11-client test matrix

2e1363e

docs: add performance TODO for high-bandwidth scaling

b174181

ericsssan added 5 commits March 28, 2026 01:30

style: fix zig fmt

9ea55b2

ericsssan merged commit 63fb9b0 into main Mar 28, 2026
7 checks passed

ericsssan deleted the feature/bbr-v3 branch March 28, 2026 09:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/bbr v3#1

Feature/bbr v3#1
ericsssan merged 35 commits intomainfrom
feature/bbr-v3

ericsssan commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ericsssan commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant