perf(slirp): handle_tcp_frame followups — extract SYN, FxHash, cold paths by dpsoft · Pull Request #80 · the-void-ia/void-box

dpsoft · 2026-05-06T14:26:07Z

Summary

Three follow-ups identified by profiling on PR #79's tcp_bulk_throughput_1mb flame graph. Each targets a specific bottleneck in handle_tcp_frame and the surrounding NAT path.

Stacked on #79. Land that first, then this rebases onto post-#79 main as a small standalone PR.

What changed

1a — Extract `handle_tcp_syn_outbound` from `handle_tcp_frame`

handle_tcp_frame was 595 lines with the outbound-SYN handler taking ~265 of them. Profiler attribution was muddy because the SYN setup was folded into the hot data-path's cost. Extract the SYN handler as a #[cold] method and reduce handle_tcp_frame to ~330 lines. Pure refactor, no behavior change.

#[inline(never)] added to handle_tcp_frame itself so the now-smaller function isn't inlined into handle_ipv4_frame (which was masking attribution again post-extraction).

1b — Switch flow_table + token_to_key to FxHash

std::HashMap defaults to SipHash for DoS resistance against attacker-influenced keys. The SLIRP flow table is keyed by guest-side ports the guest itself chooses on its own networking — same trust boundary as the guest kernel. SipHash buys nothing.

rustc_hash::FxHashMap: same API (Entry, Default), drop-in replacement. The dns_cache stays on std::HashMap because its keys are attacker-influenced (raw DNS query bytes).

1c — Mark cold paths

Added #[cold] on handle_tcp_syn_outbound and cold_branch() calls (a #[cold] #[inline(never)] no-op stub) at:

LastAck → Closed
FIN-from-guest match block
RST-from-guest
ACK-driven read failure
Write-to-host failure

std::hint::cold_path() is still unstable on rustc 1.93, so the stub is the portable equivalent.

Bench evidence — divan vs current main

scripts/bench-compare.sh --baseline origin/main --skip-vm:

Bench	Δ%	Note
`flow_table_insert_remove/1000`	-22.1%	FxHash win on the targeted bench
`flow_table_insert_remove/100`	-6.3%	same, smaller N
`nat_translate_outbound_hot_path`	-14.0%
`tcp_rx_latency_one_packet`	-7.3%
`poll_with_n_flows/1000`	-3.4%
`process_syn_during_pending_connects/100`	-14.7%

Several other benches showed run-to-run noise of ±5–20% in the same direction across multiple measurements — bench machine variance, not signal.

PMU comparison — Phase 6.3 baseline (post-cache-fix) vs followups

Same 30s capture, same workload, single bench process, perf-agent eBPF:

Metric	Phase 6.3 alone	Phase 6.3 + followups	Δ
IPC	0.786	0.788	+0.3%
Cache misses / 1K instr	3.924	3.867	-1.5%
Total Instructions (30s)	21.50 B	21.29 B	-1.0%
Total Cycles	27.35 B	27.01 B	-1.2%
Total Cache Misses (abs)	84.36 M	82.32 M	-2.4%
P99.9 on-CPU	9.41 ms	9.73 ms	within noise

Modest, consistent direction across all PMU dimensions: more work per cycle, slightly fewer cache misses, less total work. The HashMap → FxHash swap accounts for most of the instruction-count drop.

Profile attribution restored

Pre-1a: handle_tcp_frame at 25% flat — the function was a 595-line monolith and the profiler couldn't attribute internal cost.

Post-followups: handle_tcp_frame at 27.59% flat (the +2.6 pp delta is the #[inline(never)] cost), but the SYN setup path now shows up as handle_tcp_syn_outbound in flow-creation benches, separately attributed.

Test plan

cargo fmt --all -- --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
RUSTDOCFLAGS="-D warnings" cargo doc --no-deps --all-features
cargo test --test network_baseline -- --test-threads=1 — 24/24
cargo test --test network_baseline --features bench-helpers -- --test-threads=1 — 26/26 (3/3 stable runs)
scripts/bench-compare.sh --baseline origin/main --skip-vm — table above
CI

Follow-ups (not in this PR)

passt head-to-head benchmark harness — separate PR (deferred per docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md § "passt head-to-head methodology").
Larger structural reshape of handle_tcp_frame if profiling against passt shows we still trail meaningfully on CRR.

handle_tcp_frame was 595 lines, with the SYN handler taking ~265 of them. Profiling shows the function at ~25% flat CPU on tcp_bulk_throughput_1mb but with no callee attribution — the SYN setup path is folded into the hot data path's profile, masking which arm is the actual bottleneck. Extract the outbound-SYN handler into its own method. Pure refactor, no behavior change, no API surface change. The dispatcher remains a single if-tail-call so the data-path body is back to ~330 lines and the profiler can attribute SYN setup separately when running flow-creation benches.

The std HashMap uses SipHash for DoS resistance against attacker-influenced keys. The SLIRP flow table is keyed by (guest_src_port, dst_ip, dst_port) — values the guest itself chooses on its own networking — and lives behind the same trust boundary as the guest kernel. SipHash buys nothing here. rustc-hash's FxHash is a fast non-cryptographic hash designed for short integer-tuple keys. Same API as std HashMap (Entry-API, Default), drop-in replacement. Expected win: a few ns per probe, multiplied by every flow_table lookup on the data path (one per incoming TCP/UDP/ICMP frame). Production change scoped to the two SLIRP maps: dns_cache stays on std HashMap because its keys are attacker-influenced (raw DNS query bytes) and DoS resistance matters there.

Profiling shows handle_tcp_frame at ~25% flat CPU on tcp_bulk_throughput_1mb, with the hot path being Established + payload + ACK and FIN/RST/error-close arms firing on a tiny fraction of frames. Hint the compiler so it keeps cold arms out of the hot inline window. - handle_tcp_syn_outbound: marked #[cold]; SYN setup is one syscall per connection vs ~10k frames/s on bulk. - LastAck → Closed, FIN-from-guest, RST-from-guest, ACK-driven read failure, write-to-host failure: tagged via cold_branch() so the optimiser spills these arms. cold_branch() is a #[cold] #[inline(never)] no-op stub. std::hint::cold_path() is still unstable on rustc 1.93, so the stub is the portable equivalent: the call site is treated as a rare branch and the surrounding code is laid out with the cold arm spilled away.

…ution After the SYN extraction shrunk handle_tcp_frame by ~265 lines, the optimiser started inlining it into handle_ipv4_frame, hiding the data-path cost behind the upstream caller's frame. Mark it #[inline(never)] so it stays a separate frame in profiles. Cost is ~2-3 ns of call overhead per incoming TCP frame, which at ~10k frames/s on bulk throughput is ~0.003% of CPU — well below noise.

dpsoft requested a review from Copilot May 6, 2026 15:00

Copilot started reviewing on behalf of dpsoft May 6, 2026 15:01 View session

dpsoft mentioned this pull request May 6, 2026

tools: passt/pasta head-to-head comparison harness #81

Merged

6 tasks

Base automatically changed from phase6.3-window-mgmt-rebased to main May 6, 2026 21:14

dpsoft added 4 commits May 6, 2026 18:17

dpsoft force-pushed the slirp-perf-handle-tcp-frame-followups branch from 269c7d4 to 83fdf01 Compare May 6, 2026 21:18

dpsoft merged commit db05134 into main May 6, 2026
22 checks passed

dpsoft deleted the slirp-perf-handle-tcp-frame-followups branch May 6, 2026 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(slirp): handle_tcp_frame followups — extract SYN, FxHash, cold paths#80

perf(slirp): handle_tcp_frame followups — extract SYN, FxHash, cold paths#80
dpsoft merged 4 commits intomainfrom
slirp-perf-handle-tcp-frame-followups

dpsoft commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dpsoft commented May 6, 2026

Summary

What changed

1a — Extract handle_tcp_syn_outbound from handle_tcp_frame

1b — Switch flow_table + token_to_key to FxHash

1c — Mark cold paths

Bench evidence — divan vs current main

PMU comparison — Phase 6.3 baseline (post-cache-fix) vs followups

Profile attribution restored

Test plan

Follow-ups (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1a — Extract `handle_tcp_syn_outbound` from `handle_tcp_frame`