Skip to content

Feature/bbr v3#1

Merged
ericsssan merged 35 commits intomainfrom
feature/bbr-v3
Mar 28, 2026
Merged

Feature/bbr v3#1
ericsssan merged 35 commits intomainfrom
feature/bbr-v3

Conversation

@ericsssan
Copy link
Copy Markdown
Owner

No description provided.

Move bytes_in_flight from queue-time to wire-time so pacing works
naturally.  Add SendMeta ring buffer to defer onPacketSent() to send(),
where wall-clock time is available.  The cwnd check uses
bytes_in_flight + bytes_queued to maintain the same gate.

Pacing gate in send() uses token bucket (refill/consume) to smooth
CUBIC bursts that overflow shallow queues.  nextTimeout() includes
pacing deadline so the event loop wakes to drain paced packets.

Fix stream send buffer stall: SACK ranges (out-of-order ACK tracking)
silently dropped entries when the 32-slot array was full during loss
cascades.  Add merge-on-insert for adjacent ranges, and coalesce the
two closest ranges when full — no ACK info is ever silently lost.

Increase MAX_PENDING_RETX from 32 to 128 to handle burst losses when
pacing keeps the send queue non-empty during loss detection.

Transfer interop test: 10/10 pass (was ~67%).  Full interop: 22/22.
acked_frames[64] could overflow when a single ACK covers many packets
(e.g., after loss recovery with 100+ in-flight packets).  Excess frame
info was silently dropped, preventing send_acked from advancing — same
class of permanent stream buffer stall as the SACK overflow bug.

Split MAX_LOSS_EVENTS (64, for lost_frames which has its own defer
mechanism) from MAX_ACKED_FRAMES (128, matching epoch 2's sent buffer
of MAX_SENT/2 slots).  This guarantees an ACK covering all in-flight
packets for any single epoch never overflows the acked_frames buffer.
findConnByDcid now checks first_initial_dcid in addition to local_cid
and alt_local_cid.  Under packet loss, clients retransmit their Initial
using the original random DCID (they haven't received the server's SCID
yet).  Without this check, the server treated retransmissions as new
connections, creating duplicates that caused "Expected 50 handshakes,
Got: 51" failures in the handshakeloss interop test.
RFC 9000 §12.2: consecutive long-header packets (epoch 0/1) are now
appended to the same output buffer in send(), producing one UDP
datagram instead of two.  Under 30% packet loss, this halves the
probability of losing handshake data (one 30% chance vs two independent
30% chances = 51%).

Handshakeloss interop test: 8/8 pass (was 7/8).
Append HANDSHAKE_DONE (first 1-RTT packet) to the coalesced
Initial+Handshake datagram so the entire handshake response is a
single loss event.  Stop after the first 1-RTT to preserve pacing
for data packets.

Handshakecorruption: 7/8 pass (was 3/4).
Coalescing 1-RTT packets with Handshake packets caused a deterministic
stall at ~315KB during connection migration.  Restrict coalescing to
epoch 0+1 (Initial+Handshake) only.
- Add build-time congestion algorithm selection (-Dcongestion=cubic|bbr)
- Add BBR v3 implementation (bbr.zig) with Startup, Drain, ProbeBW,
  ProbeRTT phases, windowed bandwidth/RTT filters, and pacing
- Add congestion control abstraction layer (cc.zig) for comptime switch
- Move Dockerfile to interop/, add .dockerignore
- Update ECN test for BBR compatibility (inflight_hi vs cwnd check)
- Update interop-test.sh for new Docker path
Under high packet loss, the client's packets may never arrive to trigger
receive().  The server's buffered Handshake CRYPTO (partial cert chain
blocked by amplification limit) was only flushed in receive(), leaving
it unsent indefinitely.  Now tick() also flushes it, so PTO cycles can
deliver the remaining handshake data even when all client packets are
lost.
Wire-time pacing inflates send_elapsed in the delivery rate formula,
causing BBR to underestimate bandwidth (15 KB/s on a 1.25 MB/s link).
Store queued_ns in SentPacket and use it for delivery rate snapshots
while keeping sent_ns (wire-time) for loss detection.

Add shouldPace() to CC interface — BBR bypasses pacing gate during
Startup to avoid negative feedback loop where low initial estimate
throttles sends.

BBR interop: 14/22 → improved from 13/22. Transfer stall reduced
from 69 KB to 124 KB. Further BBR debugging needed.
BBR's pacing rate started at 0 and stayed there until the first ACK
set max_bw.  After the first ACK, rate was set from a low initial
bandwidth estimate, throttling sends and preventing BBR from probing
link capacity (negative feedback loop: low estimate → slow pacing →
low delivery rate → low estimate).

Bootstrap rate to initial_cwnd / initial_rtt × startup_gain (~4.2 MB/s)
so the first burst is paced at a reasonable rate.

BBR interop: 15/22 (was 13/22).  Transfer stall at 161 KB needs
further investigation.
- Drain: use BDP×cwnd_gain target instead of locking cwnd to
  inflight_hi (trapped BBR permanently in Drain)
- Delivery rate: use queue-time (not wire-time) for send_elapsed
  to prevent pacing from depressing bandwidth estimates
- Pacing refill: only advance last_refill_ns when time elapsed,
  preventing event loop busy-spin on stale deadline
- ACK skip-ahead: when pacing blocks the head of send queue,
  scan for non-ack-eliciting packets (ACKs) and swap them to
  the front so the server always responds to client packets
- Bootstrap BBR initial pacing rate to avoid Startup throttle
- shouldPace(): bypass pacing gate during BBR Startup

BBR interop: 13/22 (was 13/22, transfer 71→227 KB). Further work
needed on BBR's Startup burst causing 55% loss on shallow queues.
CUBIC still 22/22.
- shouldPace() now always returns true for BBR — the bootstrapped
  initial pacing rate (4.2 MB/s) prevents the low-estimate feedback
  loop while still smoothing bursts
- Increase SEND_BUF_SIZE from 64KB to 128KB — BBR Startup cwnd peaks
  at 200KB / 3 streams = 67KB per stream, exceeding the 64KB buffer
  and causing permanent BufferFull stalls

BBR interop still at 13-14/22. Remaining issue: BBR Startup 2.885×
pacing gain inherently overflows 25-packet queues (queue fills in
12ms). Recovery works but is slow. Need to either reduce Startup
aggressiveness or improve post-loss recovery speed.
…eRTT

- enterDrain: reduce inflight_hi to BDP (not Startup peak) when
  excessive loss triggered Startup exit
- updateDrain: apply loss bounding during Drain (was only in ProbeBW)
- enterProbeRtt: reset min_rtt to force re-measurement (Linux behavior)
- shouldPace: always true — enforce pacing during Startup using
  bootstrapped rate to prevent queue overflow
- Revert stream buffer to 64KB (128KB made BBR worse — deeper hole)

BBR interop: 13/22. Core issue: BBR Startup pacing gain (2.885×)
inherently overflows 25-packet queues. Bootstrapped rate of 1.45 MB/s
exceeds 1.25 MB/s link, filling queue in 12ms. Post-loss recovery
can't keep up with 64KB stream buffers. Needs Startup redesign for
shallow-queue environments.
CUBIC: 22/22 unaffected.
- Startup: check isExcessiveLoss on every ACK (not just round_start)
  to exit early before cwnd inflates from 58KB to 200+KB
- Drain/ProbeBW DOWN: use 1.0× pacing gain instead of 0.346×/0.9×
  to match CUBIC's recovery speed (retransmissions in 35ms not 211ms)
- Pacing gate: bypass when bif=0 to get packets on wire urgently
- PTO: force-arm when bytes_queued > 0 and bif = 0
- ACK skip-ahead: scan queue for non-ack-eliciting packets when
  pacing blocks the head packet

BBR interop: 15-16/22 (was 13/22). Transfer 69→244 KB.
Remaining: server dies after ~1.5s because stream buffers fill
(64KB unacked) and all retransmissions complete. Need mechanism
to continue loss detection after bif reaches 0.
CUBIC: 22/22 unaffected.
…d packets

BBR congestion control:
- Remove min_rate_floor filter from onAckReceived; the 100-round BW
  filter window prevents death spiral without rejecting valid samples
- Add pacing rate floor (INITIAL_CWND / min_rtt) in updatePacingRate
- Preserve max_bw across persistent congestion so BBR recovers
  immediately after blackhole instead of re-probing from zero

Path migration (rebind-addr, rebind-port, connectionmigration):
- declareEpochLost: on migration, declare all in-flight 1-RTT packets
  lost and queue their stream frames for retransmission
- Preserve cwnd across migration to avoid throughput collapse
- Reset RTT estimator and PTO count on migration
- Track prev_peer_addr to suppress re-migration from late old-path packets
- moveLastToFront: PATH_CHALLENGE bypasses pacing-blocked data
- Server: sync peer_addr from connection on path_migrated event

Coalesced packet handling (handshakecorruption):
- skipLongHeaderPacket: skip one unprocessable packet without dropping
  the entire datagram, so coalesced Handshake/1-RTT packets proceed
- Accept client's switched DCID (local_cid/alt_local_cid) in Initial
  validation during handshake

Send queue and loss recovery:
- Unconditional storeSendMeta fixes stale metadata for non-tracked packets
- deferStreamRetx helper eliminates duplicate pending retx logic
- Wire-time timestamps for delivery rate (sent_ns, not queued_ns)
- Retransmission cwnd cap prevents bytes_queued from exceeding cwnd

Server fixes:
- hq-interop: send FIN on file-not-found instead of silent deactivation
- Increase test stack to 64MB (Connection struct overflow in Debug mode)
- Fix PMTUD test to check send-queue metadata instead of loss.sent
The first delivery rate sample carried the RTT estimator's bootstrap
value (K_INITIAL_RTT = 10ms) before any real measurement existed.
BBR accepted this as min_rtt, making BDP = max_bw × 10ms ≈ 12KB
instead of max_bw × 32ms ≈ 38KB.  With BDP too small, the Drain
exit condition (inflight ≤ BDP) never triggered, trapping BBR in
Drain permanently and collapsing goodput to ~2.8 Mbps on a 10 Mbps
link.

Fixes:
- Reject rtt_ns == K_INITIAL_RTT_NS (exact 10ms) in BBR min_rtt update
- Return rtt_ns = 0 in delivery rate sample when RTT estimator is
  uninitialized, preventing the bootstrap placeholder from propagating
…le recovery

Goodput improved from 2.8 Mbps to 9.4 Mbps on 10 Mbps link by
interleaving drainSend inside flushTransfers.  Without this, the
cwnd check saw bytes_in_flight=0 during the fill phase, starving
BBR's pipe.

Server event loop (tools/server.zig):
- flushTransfers now takes send socket params and calls drainSend
  after each round-robin pass, keeping bytes_in_flight current
- Unconditional drainSend after the loop ensures PATH_CHALLENGE,
  ACKs, and retransmissions are always flushed even when all
  transfers are blocked (buffer full, amplification limit)

Path migration (connection.zig):
- Preserve min_rtt across migration — resetting to 10ms default
  caused time-loss thresholds to fire before retransmitted packets
  could be ACKed on a 30ms path
- Reset bytes_in_flight to 0 instead of declareEpochLost — proactive
  retransmission of all in-flight packets caused 3x amplification
  (many were already received by the client, ACKs still in transit)
- Arm PTO immediately after migration so retransmissions of truly
  lost packets are handled by normal loss detection

Loss detection (connection.zig):
- Skip epoch 0/1 (Initial/Handshake) in time-loss detection when
  established — keys are zeroed after handshake, retransmit panics

BBR (bbr.zig):
- applyLossBounding floor at max(bdp, BBR_MIN_CWND) prevents
  inflight_hi spiral after blackhole recovery
The application-level token bucket depletes after the initial 10-packet
burst.  With pacing enforced (bytes_in_flight > 0), the server is locked
to drip-feeding 1 packet per ACK — the token refill rate can't keep up
with the bursty send pattern needed for bandwidth discovery.  A ~10ms
ACK delay (Docker VM jitter) is enough to collapse throughput from
9.4 Mbps to <1 Mbps.

Bypass pacing during Startup (filled_pipe == false), matching TCP slow
start behavior.  The cwnd still limits total in-flight data.  Once
Startup completes and filled_pipe is set, pacing is enforced for
steady-state fairness.

Goodput: 9430 (±3) kbps — stable across 5 runs.
Crosstraffic: 6.1–6.6 Mbps.
Interop: 22/22.
At steady state, bytes_in_flight naturally sits at ~BDP.  The strict
<= bdp() check for exiting DOWN fails by a fraction of a packet
(e.g. bif=32798 vs BDP=32297), permanently trapping BBR in DOWN.
max_bw then decays over its 100-round filter window (~3s), causing
throughput to collapse from ~9 Mbps to <200 kbps.

This manifested as flaky transfer timeouts: Startup ramps successfully
(9+ Mbps), but once ProbeBW takes over, the stuck DOWN phase causes
a gradual decline (seconds 4-7) followed by collapse (second 8+).

Fix: add 1 MSS headroom to DOWN and Drain exit conditions so the
check fires reliably even when bif ≈ BDP.

Transfer test: 3/3 passes (was intermittently timing out).
Full suite: 22/22 pass individually.
Goodput: 9102-9430 kbps.
…loss

After the first packet on the CM socket, use_cm_sock was set to true
and never reset.  When the client rebinds back to the original path
(or the sim stops NAT'ing through CM), the server kept sending via
the CM socket — which can't route to clients on the original network.

This caused rebind-port and rebind-addr to fail intermittently in
full suite runs: ~75% of the file transferred on the original socket,
then the remaining 25% sent via CM socket to an unreachable address.

Fix: track the CURRENT socket per packet instead of a one-way flag.
Also use the incoming packet's socket for responses in processPacket
rather than the global flag, so original-socket ACK responses aren't
misrouted through the CM socket.

Full suite: 22/22 × 2 consecutive runs. G: 9429 (±2) kbps.
- Replace inlined active_sock logic with slotSendSock (use_cm_sock is
  now assigned before the call, so slotSendSock returns the correct
  per-packet socket)
- Consolidate bdp()+MSS comments: explain the WHY once at Drain exit,
  reference it from ProbeBW DOWN
- Trim redundant first line from shouldPace doc comment
- Update stale use_cm_sock field comment to reflect bidirectional tracking
…tests

- bdpHeadroom(): use max(MSS, bdp/32) so headroom scales from ~1 MSS
  at 10 Mbps to ~12 MB at 100 Gbps (where MSS alone is negligible)
- use_cm_sock: conditional write (only on change) to avoid dirtying
  the cache line on every packet in the common no-migration case
- Add 5 unit tests: shouldPace Startup/Drain/persistent-congestion,
  ProbeBW DOWN headroom (exact/near/far BDP), Drain headroom,
  bdpHeadroom scaling at 10 Mbps vs 100 Gbps

Full suite: 22/22. G: 9430 (±7) kbps.
The Drain and ProbeBW DOWN exit condition compared prior_inflight
(pre-ACK) against BDP.  But in an application-level stack, the server
refills to cwnd (≈2×BDP) between ACKs, so prior_inflight ≈ 2×BDP
and the check can never pass — trapping BBR permanently.

Fix: use `prior_inflight - bytes_acked` (post-ACK inflight), which
reflects the actual pipe depth after draining.  The math is exact:
at steady state, prior=2×BDP, acked=BDP → post=BDP → exits.  After
UP overshoot, prior=2.5×BDP, acked=BDP → post=1.5×BDP → waits.
No headroom constants needed; scales from 10 Mbps to 100 Gbps.

Also: conditional use_cm_sock write (only on change), 4 unit tests
for shouldPace lifecycle and post-ACK inflight Drain/DOWN exit.

Full suite: 22/22. G: 9432 (±1) kbps.
1. Path migration bif desync (connection.zig + loss_recovery.zig):
   onPathMigration set bytes_in_flight=0 but old packets kept
   in_flight=true.  When later ACKed, saturating subtract drove bif
   below actual new-path inflight, killing PTO.  Fix: clearInflight()
   marks all existing sent packets as not-in-flight.

2. activatePending stale h3_headers_sent (server.zig):
   Reused transfer slot kept h3_headers_sent=true from previous
   transfer, skipping HEADERS frame on new H3 request.  Fix: reset
   the flag when activating a pending transfer.

3. sendH3ControlStreams duplicate on retry (server.zig):
   Partial success (stream 3 sent, stream 7 failed) re-sent stream 3
   on retry, duplicating control stream data.  Fix: check send_offset
   to skip already-sent streams.

4. allocateSlot memory leak (server.zig):
   Missing errdefer if Conn.accept fails after page_allocator.create.

Full suite: 22/22 (C1 flaky, passes on retry). G: 9430 (±2) kbps.
…ed corrupted

The PTO handler branched on `app_keys != null` to choose between
post-handshake (PING/stream probes) and handshake (CRYPTO retransmit)
paths.  But app_keys are derived when the server sends its own
Finished — BEFORE receiving the client's Finished.

When the client's Handshake Finished was corrupted (30% corruption
test), the server had app_keys but wasn't established.  PTO sent
PINGs instead of retransmitting its Handshake response.  Without
the Handshake retransmit, the client never retransmits its Finished,
HANDSHAKE_DONE is never sent, and the client never sends the HTTP
request.  Connection idles out.

Fix: branch on `state == .established` instead of `app_keys != null`.

Full suite: 22/22. G: 9432 (±1) kbps.
… per packet

Previously: encode header into enc_scratch → encrypt pkt_scratch into
enc_scratch → memcpy enc_scratch into sq[].buf (1452 bytes per packet).

Now: reserve the next send queue slot via reserveSendSlot(), encode
header and encrypt directly into sq[].buf, then commitSendSlot().
Eliminates one full-packet memcpy on the hot path for all three
packet types (1-RTT STREAM, Initial CRYPTO, Handshake CRYPTO).

enqueueSend() is preserved as a fallback for callers that build
packets in scratch buffers (ACKs, CONNECTION_CLOSE, VERSION_NEG).

Full suite: 22/22. G: 9427 (±8) kbps.
Server Initial datagrams were NOT padded to the 1200-byte minimum
required by RFC 9000 §14.1.  Coalesced Initial+Handshake datagrams
were only ~920 bytes, causing the Handshake CRYPTO (cert chain +
CertVerify + Finished) to be split across multiple packets.

At 30% loss (handshakeloss test), each additional Handshake packet
consumed amplification budget.  4 PTO retransmits of the split
packets exhausted the 3× budget, leaving nothing for the CRYPTO
tail.  The handshake stalled permanently.

With 1200-byte padding, the Handshake CRYPTO fits in the coalesced
datagram alongside the Initial.  Fewer separate packets needed,
less budget consumed, handshake completes reliably.

handshakeloss: passes (was flaky ~20% failure rate).
handshakecorruption: passes.
Full suite: 22/22. G: 9430 (±4) kbps.
When a new best value entered the filter, all three entries (best,
second, third) were set to the same value AND the same round.  After
`window` rounds, all three expired simultaneously.  The filter
collapsed to whatever the current sample was — often a low value
during a transient dip — and max_bw never recovered.

This caused the transfer test (3-stream, 10MB) to collapse at ~8s:
the Startup peak (1.29M, slightly inflated from unpaced burst) set
all three filter entries at the same round.  ProbeBW DOWN samples
(~1.22M, below the inflated peak) never entered the filter.  After
100 rounds, all three expired → max_bw dropped to ~700K → cwnd
shrank → throughput collapsed to 20 pkt/s.

Fix: demote old entries instead of resetting.  When a new best
arrives, shift best→second→third.  This preserves entries from
different rounds, so when the best expires, the second-best (from
a more recent round) takes over instead of collapsing.

Transfer test: passes (was ~20% flaky).
Full suite: 22/22. G: 9425 (±8) kbps.
Each connection's writeKeyLog called createFileAbsolute (which
truncates) then wrote at offset 0.  Only the last connection's
keys survived.  tshark couldn't decrypt failing connections.

Fix: accumulate all keys in a global 64KB buffer, rewrite the
entire file on each update.  50 connections × 4 lines ≈ 40KB.
@ericsssan ericsssan merged commit 63fb9b0 into main Mar 28, 2026
7 checks passed
@ericsssan ericsssan deleted the feature/bbr-v3 branch March 28, 2026 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant