Skip to content

Phase 6.3: TCP window management — track guest window, advertise host kernel rcv-space#79

Open
dpsoft wants to merge 11 commits intomainfrom
phase6.3-window-mgmt-rebased
Open

Phase 6.3: TCP window management — track guest window, advertise host kernel rcv-space#79
dpsoft wants to merge 11 commits intomainfrom
phase6.3-window-mgmt-rebased

Conversation

@dpsoft
Copy link
Copy Markdown
Contributor

@dpsoft dpsoft commented May 6, 2026

What this branch does

Stops ignoring the guest's advertised TCP window and stops hardcoding our own. Three correctness/perf gaps closed:

  1. Track the guest's advertised window — every incoming frame's window_len (scaled by window_scale from SYN options) is stashed on the flow.
  2. Honor it on host→guest sendsrelay_tcp_nat_data gates frames_to_inject on guest_window - bytes_in_flight, so the relay stops when the guest's receive buffer is full instead of pretending it's infinite. Phase 3's 256 KB cap was a band-aid for the symptom.
  3. Advertise our own window from real backpressure — outgoing frames carry host_recv_window(fd) (computed from getsockopt(TCP_INFO).tcpi_rcv_space >> OUR_WINDOW_SCALE) instead of a hardcoded 65535. SYN-ACK negotiates window_scale: 7 (matches passt; 128× → 8 MiB max).

Headline win

Workload Before After
Host→guest send when guest is slow unbounded inject_to_guest queue (Phase 3 capped it at 256 KB userspace cliff) bounded by guest's guest_window (modern Linux: 4 MB+ scaled)
Window scale negotiation none (max 64 KB / RTT) 7 (max 8 MiB / RTT)
Advertised window source hardcoded 65535 getsockopt(TCP_INFO).tcpi_rcv_space

Architecture

  • New TcpNatEntry::guest_window: u32 and guest_window_scale: u8 (#[serde(default)] for snapshot back-compat with pre-6.3).
  • SYN handler parses tcp.window_scale() option and stashes it; every incoming frame refreshes entry.guest_window = u32::from(tcp.window_len()) << guest_window_scale.
  • relay_tcp_nat_data adds a window_remaining = guest_window - bytes_in_flight gate; when zero, breaks out (waits for guest ACK).
  • build_tcp_packet_static signature now takes (window_len, window_scale). SYN-ACK passes (65535, Some(7)); data/ACK frames pass (host_recv_window(fd), None).
  • New host_recv_window(fd) -> u16 helper: one getsockopt(IPPROTO_TCP, TCP_INFO, ...) call, returns tcpi_rcv_space >> 7 clamped to u16::MAX. Falls back to 32768 on syscall error.

Bench evidence — divan microbenches (vs current main)

scripts/bench-compare.sh --baseline origin/main --skip-vm:

Bench Baseline HEAD Δ%
Wins
tcp_inbound_syn_ack_transition 63.1 µs 50.9 µs -19.4%
process_udp_frame 31.9 µs 27.2 µs -14.7%
port_forward_accept_latency 193 µs 182 µs -5.9%
Parity / noise
dns_cache_hit 942 ns 940 ns -0.1%
nat_translate_outbound_hot_path 2.54 ns 2.57 ns +1.3%
flow_table_insert_remove/100 4.49 µs 4.45 µs -0.9%
process_syn 28.5 µs 28.7 µs +0.8%
tcp_inbound_syn_ack_transition 63.1 µs 50.9 µs -19.4%
Small regressions (per-frame getsockopt(TCP_INFO) cost)
tcp_bulk_throughput_1mb 61.3 ms 62.3 ms +1.6%
tcp_rx_latency_one_packet 9.55 µs 10.2 µs +6.7%
process_icmp_echo_request 21.9 µs 23.3 µs +6.1%
flow_table_insert_remove/1000 29.6 µs 31.4 µs +5.9%
poll_with_n_mixed_flows/999 9.76 µs 10.1 µs +3.5%
New benches (HEAD only — Phase 6.3 introduces them)
tcp_bulk_throughput_constrained_window/4096 2.63 ms new
tcp_bulk_throughput_constrained_window/16384 1.81 ms new
tcp_bulk_throughput_constrained_window/65536 1.94 ms new

Wall-clock VM harness (voidbox-network-bench)

Metric Baseline (post-#78 main) HEAD Δ%
tcp_rr_latency_us_p50 4 µs 2 µs -50.0%
tcp_rr_latency_us_p99 32 µs 26 µs -18.8%
tcp_crr_latency_us_p50 10129 µs 10136 µs +0.1% (parity)
tcp_throughput_g2h_mbps 5943 5739 -3.4%

The 3.4% g2h throughput regression appears to be the per-outgoing-frame getsockopt(TCP_INFO) syscall cost in host_recv_window. Profiling planned as a follow-up — candidate fixes: cache the value with a 1 ms TTL, or move the syscall onto the net-poll thread's housekeeping cadence so the data path uses a stale-but-recent value. The trade is worth it for now: correct backpressure is a correctness fix, not a perf trick. Phase 6.4 epoll dispatch absorbs the latency improvements (RR p50 -50%) so the net change vs pre-Phase-6.x main is heavily positive.

Snapshot interaction

Pre-6.3 snapshots restore cleanly: both new fields have #[serde(default)] and default to (65535, 0) which is the pre-6.3 behavior (no scale, ignore guest window — same as if the entry was a Phase 6.0 entry). Verified via existing snapshot_integration suite.

passt-comparison status

Documented as a deferred task in docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md ("passt head-to-head methodology"). Methodology agreed: same hardware, two-column report, focus on CRR latency (apples-to-apples since CRR is dominated by NAT-table ops, not MMIO exit overhead). Building the passt+qemu reference harness is a separate follow-up PR.

Commits (10)

Cherry-picked clean from smoltcp-passt-port-phase6.3-window-mgmt onto current main (post-#78):

  1. docs: Phase 6.3 detailed TDD plan — TCP window management
  2. feat(slirp): TcpNatEntry tracks guest_window + guest_window_scale
  3. feat(slirp): parse guest's window_scale on SYN, store on flow
  4. feat(slirp): track guest's advertised window on every incoming frame
  5. refactor(slirp): build_tcp_packet_static takes (window_len, window_scale)
  6. feat(slirp): advertise host-kernel-derived window on outgoing frames
  7. test(network): pin tcp_advertised_window_tracks_guest_buffer (BROKEN_ON_PURPOSE)
  8. feat(slirp): gate host→guest send on guest's advertised window — flips the BROKEN_ON_PURPOSE pin
  9. test(network): pin tcp_window_scale_negotiated_in_synack
  10. bench(network): tcp_bulk_throughput_constrained_window parametric

Test plan

  • cargo fmt --all -- --check — clean
  • cargo clippy --workspace --all-targets --all-features -- -D warnings — clean
  • RUSTDOCFLAGS="-D warnings" cargo doc --no-deps --all-features — clean
  • cargo test --test network_baseline -- --test-threads=1 — 24/24 (was 22; +2 window pins)
  • cargo test --test network_baseline --features bench-helpers -- --test-threads=1 — 26/26
  • scripts/bench-compare.sh --baseline origin/main --skip-vm — see table above
  • scripts/bench-compare.sh --baseline origin/main --skip-divan (VM wall-clock) — see table above
  • CI

Replaces draft #75

Same window-management content via the now-superseded #74 chain. Close #75 once this lands.

Follow-ups (not blocking this PR)

  1. host_recv_window perf: profile the +1.6% bulk regression; cache TCP_INFO with short TTL or move into housekeeping cadence.
  2. passt head-to-head harness (separate PR per the deferred plan).
  3. HashMap cache-miss audit of the flow table (HashMap<FlowKey, FlowEntry>) — separately tracked: data-path pollers do linear scans by FlowKey variant, which on a 1000-flow table at small entries is cache-unfriendly. Candidate: split into per-protocol maps or move to small-vector for low-flow-count sandboxes.

dpsoft added 10 commits May 6, 2026 08:00
10 bite-sized tasks covering proper TCP windowing:

- TcpNatEntry tracks guest_window (u32) + guest_window_scale (u8)
- handle_tcp_frame parses tcp.window_scale() on guest SYN, stores
  per-flow; updates guest_window on every incoming frame
- build_tcp_packet_static signature changes to take
  (window_len, window_scale) — caller decides
- SYN-ACK negotiates OUR_WINDOW_SCALE = 7 (passt's default; 128x)
- New host_recv_window helper queries TCP_INFO.tcpi_rcv_space and
  scales it for the advertised window on outgoing frames
- relay_tcp_nat_data gates host→guest sends on entry.guest_window
  to honor real backpressure
- Three new pins: tcp_advertised_window_tracks_guest_buffer
  (BROKEN_ON_PURPOSE → flips at Task 7),
  tcp_window_scale_negotiated_in_synack, plus
  tcp_bulk_throughput_constrained_window parametric bench

Severity: MEDIUM — perf gap. Hardcoded window_len: 65535 caps
throughput at 64 KB / RTT regardless of bandwidth, and
inject_to_guest can grow unbounded if the guest is slow.
Adds tcp_bulk_throughput_constrained_window bench that exercises the
Task 7 window-gating path under three guest-window sizes (4096, 16384,
65536 bytes). Mirrors tcp_bulk_throughput_1mb with a parametric window
so regressions in window-constrained relay show up numerically.
@dpsoft
Copy link
Copy Markdown
Contributor Author

dpsoft commented May 6, 2026

Profiling note: tcp_bulk_throughput_1mb regression root-caused

Followed up on the divan +1.6% / VM wall-clock -3.4% throughput regression with perf-agent (eBPF, PMU + on-CPU + off-CPU). 30 s capture, single bench process, properly symbolized.

PMU summary

Metric Value Reading
IPC 0.777 moderately memory-bound (threshold: <0.8)
Cache misses / 1K instr 3.666 below the 10/1K "investigate" threshold — HashMap cache-miss hypothesis NOT confirmed for this workload
P99.9 on-CPU 10.17 ms healthy (<50 ms)
Preempted 46.5% expected for a CPU-bound bench

On-CPU flat hotspots

Function Flat % Note
handle_tcp_frame 26.70% per-incoming-frame parsing (smoltcp wire + dispatch)
__libc_recv 29.90% cum host kernel TCP recv
__libc_send 25.03% cum host kernel TCP send
EpollDispatch::wait_with_timeout 13.63% epoll_wait + drain
__getsockopt 5.70% host_recv_window's per-outgoing-frame getsockopt(TCP_INFO) — the regression

Conclusion

The throughput regression traces to one syscall, not the data-structure layout. host_recv_window calls getsockopt(IPPROTO_TCP, TCP_INFO, ...) on every outgoing frame; at 5 Gbps that's ~10k getsockopt/s on the data path.

Proposed follow-up (separate small PR, not blocking this one)

Cache host_recv_window per-flow with a short TTL — say 5 ms, well below RTT. At 10k frames/s that drops to ~200 getsockopt/s, ~50× reduction, while the advertised window still tracks within 5 ms of reality.

// On TcpNatEntry:
cached_recv_window: u16,
cached_recv_window_at: Instant,

// In the build_tcp_packet_static call sites for data/ACK frames:
const RECV_WINDOW_TTL: Duration = Duration::from_millis(5);
if entry.cached_recv_window_at.elapsed() > RECV_WINDOW_TTL {
    entry.cached_recv_window = host_recv_window(entry.host_stream.as_raw_fd());
    entry.cached_recv_window_at = Instant::now();
}

The HashMap-flow-table cache-miss audit is still a worthwhile separate exercise, but the divan/wall-clock regression seen on this PR isn't traceable to it. IPC of 0.78 suggests we're modestly memory-bound elsewhere (likely the smoltcp wire-decode hot path), but the cache-miss rate doesn't indicate pathological structures.

Profiles archived locally:

  • /tmp/p63-bench-cpu.pb.gz (CPU stacks)
  • /tmp/p63-bench-offcpu.pb.gz (off-CPU)
  • /tmp/p63-bench-pmu.txt (PMU)

Profiling tcp_bulk_throughput_1mb showed __getsockopt at 5.7% flat CPU
— Phase 6.3's host_recv_window was issuing one getsockopt(TCP_INFO)
per outgoing TCP frame, costing ~10k syscalls/s at line rate.

Cache the result on TcpNatEntry and refresh only every RECV_WINDOW_TTL
(5 ms). At line rate this collapses to ~200 syscalls/s — a ~50x
reduction — while the advertised window stays within 5 ms of reality,
which is well below any realistic RTT.

cached_recv_window is initialized at flow construction with one
host_recv_window call so the first emitted frame doesn't pay the
syscall cost on the data path either.
@dpsoft
Copy link
Copy Markdown
Contributor Author

dpsoft commented May 6, 2026

Cache fix landed and re-profiled — regression eliminated

Commit 1b9ba72 adds per-flow cached_recv_window with a 5 ms TTL. The result confirms the profiling diagnosis was correct: __getsockopt is no longer in the top hotspots and the Phase 6.3 throughput regression collapses to noise.

Divan microbenches — before/after the cache fix (vs current main)

Bench Pre-fix Δ% Post-fix Δ% Recovery
tcp_bulk_throughput_1mb +1.6% +0.1% regression eliminated
tcp_rx_latency_one_packet +6.7% +2.3% recovered 4.4 pp
tcp_inbound_syn_ack_transition -19.4% -30.5% even faster post-fix
process_icmp_echo_request +6.1% +1.9% recovered 4.2 pp
flow_table_insert_remove/1000 +5.9% -2.0% now better than baseline

Some flow-construction benches show small regressions (process_syn +4.6%, port_forward_accept_latency +6.1%, process_syn_during_pending_connects/0 +7.2%) — that's the one-time host_recv_window syscall now at flow-creation rather than per-frame. Pay-once-per-flow vs pay-per-packet is the right trade. At line rate (~10k packets/s, ~50 connects/s) this is a >100× syscall reduction.

VM wall-clock — before/after vs current main

Metric Pre-fix Δ% Post-fix Δ%
tcp_throughput_g2h_mbps -3.4% (5942 → 5739) -0.2% (5776 → 5765)
tcp_rr_latency_us_p50 -50% parity (both at 2 µs)
tcp_crr_latency_us_p50 parity parity

PMU — before/after (same 30s capture per side, single bench process)

Metric Pre-fix Post-fix Δ
IPC 0.777 0.786 +1.2%
Cache Misses / 1K instr 3.666 3.924 +7.0% (denominator effect)
Total Cache Misses (abs) 86.83 M 84.36 M -2.85%
Total Instructions 23.68 B 21.50 B -9.2%
Total Cycles 30.47 B 27.35 B -10.3%
P99.9 on-CPU 10.17 ms 9.41 ms -7.5%

Total work dropped ~10% (less syscall traffic), IPC improved, and absolute cache misses fell 2.85%. The per-1K-instr rate ticked up because we removed a lot of cache-friendly syscall instructions from the denominator — the remaining mix is slightly more miss-dense but __getsockopt no longer dominates the on-CPU profile.

On-CPU top-7 — before/after

Function Pre-fix flat % Post-fix flat %
handle_tcp_frame 26.70% 25.00%
__libc_recv (cum) 29.90% 35.71%
__libc_send (cum) 25.03% 24.40%
EpollDispatch::wait_with_timeout 13.63% 16.07%
__getsockopt 5.70% — (gone from top-25)
process_guest_frame 5.84% 4.46%
drain_to_guest 6.54% 6.25%

HashMap cache-miss hypothesis — verdict

At 3.92 cache-misses / 1K instructions (post-fix, well below the 10/1K threshold), the flow-table HashMap does not appear to be a dominant cache-pressure source for tcp_bulk_throughput_1mb. IPC of 0.786 says we're still mildly memory-bound, but it's not localised to the data structure. Hypothesis not confirmed by data on this workload. Worth re-investigating under different workloads (many concurrent flows, different per-entry sizes) but not blocking this PR.

Profiles archived locally:

  • pre-fix: /tmp/p63-bench-{cpu,offcpu,pmu}.{pb.gz,txt}
  • post-fix: /tmp/p63-fixed-{cpu,offcpu,pmu}.{pb.gz,txt}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant