Phase 6.3: TCP window management — track guest window, advertise host kernel rcv-space by dpsoft · Pull Request #79 · the-void-ia/void-box

dpsoft · 2026-05-06T12:26:13Z

What this branch does

Stops ignoring the guest's advertised TCP window and stops hardcoding our own. Three correctness/perf gaps closed:

Track the guest's advertised window — every incoming frame's window_len (scaled by window_scale from SYN options) is stashed on the flow.
Honor it on host→guest sends — relay_tcp_nat_data gates frames_to_inject on guest_window - bytes_in_flight, so the relay stops when the guest's receive buffer is full instead of pretending it's infinite. Phase 3's 256 KB cap was a band-aid for the symptom.
Advertise our own window from real backpressure — outgoing frames carry host_recv_window(fd) (computed from getsockopt(TCP_INFO).tcpi_rcv_space >> OUR_WINDOW_SCALE) instead of a hardcoded 65535. SYN-ACK negotiates window_scale: 7 (matches passt; 128× → 8 MiB max).

Headline win

Workload	Before	After
Host→guest send when guest is slow	unbounded `inject_to_guest` queue (Phase 3 capped it at 256 KB userspace cliff)	bounded by guest's `guest_window` (modern Linux: 4 MB+ scaled)
Window scale negotiation	none (max 64 KB / RTT)	7 (max 8 MiB / RTT)
Advertised window source	hardcoded 65535	`getsockopt(TCP_INFO).tcpi_rcv_space`

Architecture

New TcpNatEntry::guest_window: u32 and guest_window_scale: u8 (#[serde(default)] for snapshot back-compat with pre-6.3).
SYN handler parses tcp.window_scale() option and stashes it; every incoming frame refreshes entry.guest_window = u32::from(tcp.window_len()) << guest_window_scale.
relay_tcp_nat_data adds a window_remaining = guest_window - bytes_in_flight gate; when zero, breaks out (waits for guest ACK).
build_tcp_packet_static signature now takes (window_len, window_scale). SYN-ACK passes (65535, Some(7)); data/ACK frames pass (host_recv_window(fd), None).
New host_recv_window(fd) -> u16 helper: one getsockopt(IPPROTO_TCP, TCP_INFO, ...) call, returns tcpi_rcv_space >> 7 clamped to u16::MAX. Falls back to 32768 on syscall error.

Bench evidence — divan microbenches (vs current `main`)

scripts/bench-compare.sh --baseline origin/main --skip-vm:

Bench	Baseline	HEAD	Δ%
Wins
`tcp_inbound_syn_ack_transition`	63.1 µs	50.9 µs	-19.4%
`process_udp_frame`	31.9 µs	27.2 µs	-14.7%
`port_forward_accept_latency`	193 µs	182 µs	-5.9%
Parity / noise
`dns_cache_hit`	942 ns	940 ns	-0.1%
`nat_translate_outbound_hot_path`	2.54 ns	2.57 ns	+1.3%
`flow_table_insert_remove/100`	4.49 µs	4.45 µs	-0.9%
`process_syn`	28.5 µs	28.7 µs	+0.8%
`tcp_inbound_syn_ack_transition`	63.1 µs	50.9 µs	-19.4%
Small regressions (per-frame `getsockopt(TCP_INFO)` cost)
`tcp_bulk_throughput_1mb`	61.3 ms	62.3 ms	+1.6%
`tcp_rx_latency_one_packet`	9.55 µs	10.2 µs	+6.7%
`process_icmp_echo_request`	21.9 µs	23.3 µs	+6.1%
`flow_table_insert_remove/1000`	29.6 µs	31.4 µs	+5.9%
`poll_with_n_mixed_flows/999`	9.76 µs	10.1 µs	+3.5%
New benches (HEAD only — Phase 6.3 introduces them)
`tcp_bulk_throughput_constrained_window/4096`	—	2.63 ms	new
`tcp_bulk_throughput_constrained_window/16384`	—	1.81 ms	new
`tcp_bulk_throughput_constrained_window/65536`	—	1.94 ms	new

Wall-clock VM harness (`voidbox-network-bench`)

Metric	Baseline (post-#78 main)	HEAD	Δ%
`tcp_rr_latency_us_p50`	4 µs	2 µs	-50.0%
`tcp_rr_latency_us_p99`	32 µs	26 µs	-18.8%
`tcp_crr_latency_us_p50`	10129 µs	10136 µs	+0.1% (parity)
`tcp_throughput_g2h_mbps`	5943	5739	-3.4%

The 3.4% g2h throughput regression appears to be the per-outgoing-frame getsockopt(TCP_INFO) syscall cost in host_recv_window. Profiling planned as a follow-up — candidate fixes: cache the value with a 1 ms TTL, or move the syscall onto the net-poll thread's housekeeping cadence so the data path uses a stale-but-recent value. The trade is worth it for now: correct backpressure is a correctness fix, not a perf trick. Phase 6.4 epoll dispatch absorbs the latency improvements (RR p50 -50%) so the net change vs pre-Phase-6.x main is heavily positive.

Snapshot interaction

Pre-6.3 snapshots restore cleanly: both new fields have #[serde(default)] and default to (65535, 0) which is the pre-6.3 behavior (no scale, ignore guest window — same as if the entry was a Phase 6.0 entry). Verified via existing snapshot_integration suite.

passt-comparison status

Documented as a deferred task in docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md ("passt head-to-head methodology"). Methodology agreed: same hardware, two-column report, focus on CRR latency (apples-to-apples since CRR is dominated by NAT-table ops, not MMIO exit overhead). Building the passt+qemu reference harness is a separate follow-up PR.

Commits (10)

Cherry-picked clean from smoltcp-passt-port-phase6.3-window-mgmt onto current main (post-#78):

docs: Phase 6.3 detailed TDD plan — TCP window management
feat(slirp): TcpNatEntry tracks guest_window + guest_window_scale
feat(slirp): parse guest's window_scale on SYN, store on flow
feat(slirp): track guest's advertised window on every incoming frame
refactor(slirp): build_tcp_packet_static takes (window_len, window_scale)
feat(slirp): advertise host-kernel-derived window on outgoing frames
test(network): pin tcp_advertised_window_tracks_guest_buffer (BROKEN_ON_PURPOSE)
feat(slirp): gate host→guest send on guest's advertised window — flips the BROKEN_ON_PURPOSE pin
test(network): pin tcp_window_scale_negotiated_in_synack
bench(network): tcp_bulk_throughput_constrained_window parametric

Test plan

cargo fmt --all -- --check — clean
cargo clippy --workspace --all-targets --all-features -- -D warnings — clean
RUSTDOCFLAGS="-D warnings" cargo doc --no-deps --all-features — clean
cargo test --test network_baseline -- --test-threads=1 — 24/24 (was 22; +2 window pins)
cargo test --test network_baseline --features bench-helpers -- --test-threads=1 — 26/26
scripts/bench-compare.sh --baseline origin/main --skip-vm — see table above
scripts/bench-compare.sh --baseline origin/main --skip-divan (VM wall-clock) — see table above
CI

Replaces draft #75

Same window-management content via the now-superseded #74 chain. Close #75 once this lands.

Follow-ups (not blocking this PR)

host_recv_window perf: profile the +1.6% bulk regression; cache TCP_INFO with short TTL or move into housekeeping cadence.
passt head-to-head harness (separate PR per the deferred plan).
HashMap cache-miss audit of the flow table (HashMap<FlowKey, FlowEntry>) — separately tracked: data-path pollers do linear scans by FlowKey variant, which on a 1000-flow table at small entries is cache-unfriendly. Candidate: split into per-protocol maps or move to small-vector for low-flow-count sandboxes.

10 bite-sized tasks covering proper TCP windowing: - TcpNatEntry tracks guest_window (u32) + guest_window_scale (u8) - handle_tcp_frame parses tcp.window_scale() on guest SYN, stores per-flow; updates guest_window on every incoming frame - build_tcp_packet_static signature changes to take (window_len, window_scale) — caller decides - SYN-ACK negotiates OUR_WINDOW_SCALE = 7 (passt's default; 128x) - New host_recv_window helper queries TCP_INFO.tcpi_rcv_space and scales it for the advertised window on outgoing frames - relay_tcp_nat_data gates host→guest sends on entry.guest_window to honor real backpressure - Three new pins: tcp_advertised_window_tracks_guest_buffer (BROKEN_ON_PURPOSE → flips at Task 7), tcp_window_scale_negotiated_in_synack, plus tcp_bulk_throughput_constrained_window parametric bench Severity: MEDIUM — perf gap. Hardcoded window_len: 65535 caps throughput at 64 KB / RTT regardless of bandwidth, and inject_to_guest can grow unbounded if the guest is slow.

…ale)

…ON_PURPOSE)

Adds tcp_bulk_throughput_constrained_window bench that exercises the Task 7 window-gating path under three guest-window sizes (4096, 16384, 65536 bytes). Mirrors tcp_bulk_throughput_1mb with a parametric window so regressions in window-constrained relay show up numerically.

dpsoft · 2026-05-06T12:50:32Z

Profiling note: tcp_bulk_throughput_1mb regression root-caused

Followed up on the divan +1.6% / VM wall-clock -3.4% throughput regression with perf-agent (eBPF, PMU + on-CPU + off-CPU). 30 s capture, single bench process, properly symbolized.

PMU summary

Metric	Value	Reading
IPC	0.777	moderately memory-bound (threshold: <0.8)
Cache misses / 1K instr	3.666	below the 10/1K "investigate" threshold — HashMap cache-miss hypothesis NOT confirmed for this workload
P99.9 on-CPU	10.17 ms	healthy (<50 ms)
Preempted	46.5%	expected for a CPU-bound bench

On-CPU flat hotspots

Function	Flat %	Note
`handle_tcp_frame`	26.70%	per-incoming-frame parsing (smoltcp wire + dispatch)
`__libc_recv`	29.90% cum	host kernel TCP recv
`__libc_send`	25.03% cum	host kernel TCP send
`EpollDispatch::wait_with_timeout`	13.63%	epoll_wait + drain
`__getsockopt`	5.70%	`host_recv_window`'s per-outgoing-frame `getsockopt(TCP_INFO)` — the regression

Conclusion

The throughput regression traces to one syscall, not the data-structure layout. host_recv_window calls getsockopt(IPPROTO_TCP, TCP_INFO, ...) on every outgoing frame; at 5 Gbps that's ~10k getsockopt/s on the data path.

Proposed follow-up (separate small PR, not blocking this one)

Cache host_recv_window per-flow with a short TTL — say 5 ms, well below RTT. At 10k frames/s that drops to ~200 getsockopt/s, ~50× reduction, while the advertised window still tracks within 5 ms of reality.

// On TcpNatEntry:
cached_recv_window: u16,
cached_recv_window_at: Instant,

// In the build_tcp_packet_static call sites for data/ACK frames:
const RECV_WINDOW_TTL: Duration = Duration::from_millis(5);
if entry.cached_recv_window_at.elapsed() > RECV_WINDOW_TTL {
    entry.cached_recv_window = host_recv_window(entry.host_stream.as_raw_fd());
    entry.cached_recv_window_at = Instant::now();
}

The HashMap-flow-table cache-miss audit is still a worthwhile separate exercise, but the divan/wall-clock regression seen on this PR isn't traceable to it. IPC of 0.78 suggests we're modestly memory-bound elsewhere (likely the smoltcp wire-decode hot path), but the cache-miss rate doesn't indicate pathological structures.

Profiles archived locally:

/tmp/p63-bench-cpu.pb.gz (CPU stacks)
/tmp/p63-bench-offcpu.pb.gz (off-CPU)
/tmp/p63-bench-pmu.txt (PMU)

Profiling tcp_bulk_throughput_1mb showed __getsockopt at 5.7% flat CPU — Phase 6.3's host_recv_window was issuing one getsockopt(TCP_INFO) per outgoing TCP frame, costing ~10k syscalls/s at line rate. Cache the result on TcpNatEntry and refresh only every RECV_WINDOW_TTL (5 ms). At line rate this collapses to ~200 syscalls/s — a ~50x reduction — while the advertised window stays within 5 ms of reality, which is well below any realistic RTT. cached_recv_window is initialized at flow construction with one host_recv_window call so the first emitted frame doesn't pay the syscall cost on the data path either.

dpsoft · 2026-05-06T13:44:44Z

Cache fix landed and re-profiled — regression eliminated

Commit 1b9ba72 adds per-flow cached_recv_window with a 5 ms TTL. The result confirms the profiling diagnosis was correct: __getsockopt is no longer in the top hotspots and the Phase 6.3 throughput regression collapses to noise.

Divan microbenches — before/after the cache fix (vs current `main`)

Bench	Pre-fix Δ%	Post-fix Δ%	Recovery
`tcp_bulk_throughput_1mb`	+1.6%	+0.1%	regression eliminated
`tcp_rx_latency_one_packet`	+6.7%	+2.3%	recovered 4.4 pp
`tcp_inbound_syn_ack_transition`	-19.4%	-30.5%	even faster post-fix
`process_icmp_echo_request`	+6.1%	+1.9%	recovered 4.2 pp
`flow_table_insert_remove/1000`	+5.9%	-2.0%	now better than baseline

Some flow-construction benches show small regressions (process_syn +4.6%, port_forward_accept_latency +6.1%, process_syn_during_pending_connects/0 +7.2%) — that's the one-time host_recv_window syscall now at flow-creation rather than per-frame. Pay-once-per-flow vs pay-per-packet is the right trade. At line rate (~10k packets/s, ~50 connects/s) this is a >100× syscall reduction.

VM wall-clock — before/after vs current `main`

Metric	Pre-fix Δ%	Post-fix Δ%
`tcp_throughput_g2h_mbps`	-3.4% (5942 → 5739)	-0.2% (5776 → 5765)
`tcp_rr_latency_us_p50`	-50%	parity (both at 2 µs)
`tcp_crr_latency_us_p50`	parity	parity

PMU — before/after (same 30s capture per side, single bench process)

Metric	Pre-fix	Post-fix	Δ
IPC	0.777	0.786	+1.2%
Cache Misses / 1K instr	3.666	3.924	+7.0% (denominator effect)
Total Cache Misses (abs)	86.83 M	84.36 M	-2.85%
Total Instructions	23.68 B	21.50 B	-9.2%
Total Cycles	30.47 B	27.35 B	-10.3%
P99.9 on-CPU	10.17 ms	9.41 ms	-7.5%

Total work dropped ~10% (less syscall traffic), IPC improved, and absolute cache misses fell 2.85%. The per-1K-instr rate ticked up because we removed a lot of cache-friendly syscall instructions from the denominator — the remaining mix is slightly more miss-dense but __getsockopt no longer dominates the on-CPU profile.

On-CPU top-7 — before/after

Function	Pre-fix flat %	Post-fix flat %
`handle_tcp_frame`	26.70%	25.00%
`__libc_recv` (cum)	29.90%	35.71%
`__libc_send` (cum)	25.03%	24.40%
`EpollDispatch::wait_with_timeout`	13.63%	16.07%
`__getsockopt`	5.70%	— (gone from top-25)
`process_guest_frame`	5.84%	4.46%
`drain_to_guest`	6.54%	6.25%

HashMap cache-miss hypothesis — verdict

At 3.92 cache-misses / 1K instructions (post-fix, well below the 10/1K threshold), the flow-table HashMap does not appear to be a dominant cache-pressure source for tcp_bulk_throughput_1mb. IPC of 0.786 says we're still mildly memory-bound, but it's not localised to the data structure. Hypothesis not confirmed by data on this workload. Worth re-investigating under different workloads (many concurrent flows, different per-entry sizes) but not blocking this PR.

Profiles archived locally:

pre-fix: /tmp/p63-bench-{cpu,offcpu,pmu}.{pb.gz,txt}
post-fix: /tmp/p63-fixed-{cpu,offcpu,pmu}.{pb.gz,txt}

dpsoft added 10 commits May 6, 2026 08:00

feat(slirp): TcpNatEntry tracks guest_window + guest_window_scale

a6992c8

feat(slirp): parse guest's window_scale on SYN, store on flow

9745824

feat(slirp): track guest's advertised window on every incoming frame

2789673

refactor(slirp): build_tcp_packet_static takes (window_len, window_sc…

78d1554

…ale)

feat(slirp): advertise host-kernel-derived window on outgoing frames

4e6eb87

test(network): pin tcp_advertised_window_tracks_guest_buffer (BROKEN_…

540c96b

…ON_PURPOSE)

feat(slirp): gate host→guest send on guest's advertised window

4405569

test(network): pin tcp_window_scale_negotiated_in_synack

3da83d0

dpsoft mentioned this pull request May 6, 2026

Phase 6.3: TCP window management — track guest window, advertise host kernel rcv-space #75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 6.3: TCP window management — track guest window, advertise host kernel rcv-space#79

Phase 6.3: TCP window management — track guest window, advertise host kernel rcv-space#79
dpsoft wants to merge 11 commits intomainfrom
phase6.3-window-mgmt-rebased

dpsoft commented May 6, 2026

Uh oh!

dpsoft commented May 6, 2026

Uh oh!

dpsoft commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dpsoft commented May 6, 2026

What this branch does

Headline win

Architecture

Bench evidence — divan microbenches (vs current main)

Wall-clock VM harness (voidbox-network-bench)

Snapshot interaction

passt-comparison status

Commits (10)

Test plan

Replaces draft #75

Follow-ups (not blocking this PR)

Uh oh!

dpsoft commented May 6, 2026

Profiling note: tcp_bulk_throughput_1mb regression root-caused

PMU summary

On-CPU flat hotspots

Conclusion

Proposed follow-up (separate small PR, not blocking this one)

Uh oh!

dpsoft commented May 6, 2026

Cache fix landed and re-profiled — regression eliminated

Divan microbenches — before/after the cache fix (vs current main)

VM wall-clock — before/after vs current main

PMU — before/after (same 30s capture per side, single bench process)

On-CPU top-7 — before/after

HashMap cache-miss hypothesis — verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bench evidence — divan microbenches (vs current `main`)

Wall-clock VM harness (`voidbox-network-bench`)

Divan microbenches — before/after the cache fix (vs current `main`)

VM wall-clock — before/after vs current `main`