perf(slirp): handle_tcp_frame followups — extract SYN, FxHash, cold paths#80
Merged
perf(slirp): handle_tcp_frame followups — extract SYN, FxHash, cold paths#80
Conversation
6 tasks
handle_tcp_frame was 595 lines, with the SYN handler taking ~265 of them. Profiling shows the function at ~25% flat CPU on tcp_bulk_throughput_1mb but with no callee attribution — the SYN setup path is folded into the hot data path's profile, masking which arm is the actual bottleneck. Extract the outbound-SYN handler into its own method. Pure refactor, no behavior change, no API surface change. The dispatcher remains a single if-tail-call so the data-path body is back to ~330 lines and the profiler can attribute SYN setup separately when running flow-creation benches.
The std HashMap uses SipHash for DoS resistance against attacker-influenced keys. The SLIRP flow table is keyed by (guest_src_port, dst_ip, dst_port) — values the guest itself chooses on its own networking — and lives behind the same trust boundary as the guest kernel. SipHash buys nothing here. rustc-hash's FxHash is a fast non-cryptographic hash designed for short integer-tuple keys. Same API as std HashMap (Entry-API, Default), drop-in replacement. Expected win: a few ns per probe, multiplied by every flow_table lookup on the data path (one per incoming TCP/UDP/ICMP frame). Production change scoped to the two SLIRP maps: dns_cache stays on std HashMap because its keys are attacker-influenced (raw DNS query bytes) and DoS resistance matters there.
Profiling shows handle_tcp_frame at ~25% flat CPU on tcp_bulk_throughput_1mb, with the hot path being Established + payload + ACK and FIN/RST/error-close arms firing on a tiny fraction of frames. Hint the compiler so it keeps cold arms out of the hot inline window. - handle_tcp_syn_outbound: marked #[cold]; SYN setup is one syscall per connection vs ~10k frames/s on bulk. - LastAck → Closed, FIN-from-guest, RST-from-guest, ACK-driven read failure, write-to-host failure: tagged via cold_branch() so the optimiser spills these arms. cold_branch() is a #[cold] #[inline(never)] no-op stub. std::hint::cold_path() is still unstable on rustc 1.93, so the stub is the portable equivalent: the call site is treated as a rare branch and the surrounding code is laid out with the cold arm spilled away.
…ution After the SYN extraction shrunk handle_tcp_frame by ~265 lines, the optimiser started inlining it into handle_ipv4_frame, hiding the data-path cost behind the upstream caller's frame. Mark it #[inline(never)] so it stays a separate frame in profiles. Cost is ~2-3 ns of call overhead per incoming TCP frame, which at ~10k frames/s on bulk throughput is ~0.003% of CPU — well below noise.
269c7d4 to
83fdf01
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three follow-ups identified by profiling on PR #79's
tcp_bulk_throughput_1mbflame graph. Each targets a specific bottleneck inhandle_tcp_frameand the surrounding NAT path.Stacked on #79. Land that first, then this rebases onto post-#79 main as a small standalone PR.
What changed
1a — Extract
handle_tcp_syn_outboundfromhandle_tcp_framehandle_tcp_framewas 595 lines with the outbound-SYN handler taking ~265 of them. Profiler attribution was muddy because the SYN setup was folded into the hot data-path's cost. Extract the SYN handler as a#[cold]method and reducehandle_tcp_frameto ~330 lines. Pure refactor, no behavior change.#[inline(never)]added tohandle_tcp_frameitself so the now-smaller function isn't inlined intohandle_ipv4_frame(which was masking attribution again post-extraction).1b — Switch flow_table + token_to_key to FxHash
std::HashMapdefaults to SipHash for DoS resistance against attacker-influenced keys. The SLIRP flow table is keyed by guest-side ports the guest itself chooses on its own networking — same trust boundary as the guest kernel. SipHash buys nothing.rustc_hash::FxHashMap: same API (Entry, Default), drop-in replacement. Thedns_cachestays onstd::HashMapbecause its keys are attacker-influenced (raw DNS query bytes).1c — Mark cold paths
Added
#[cold]onhandle_tcp_syn_outboundandcold_branch()calls (a#[cold] #[inline(never)]no-op stub) at:std::hint::cold_path()is still unstable on rustc 1.93, so the stub is the portable equivalent.Bench evidence — divan vs current main
scripts/bench-compare.sh --baseline origin/main --skip-vm:flow_table_insert_remove/1000flow_table_insert_remove/100nat_translate_outbound_hot_pathtcp_rx_latency_one_packetpoll_with_n_flows/1000process_syn_during_pending_connects/100Several other benches showed run-to-run noise of ±5–20% in the same direction across multiple measurements — bench machine variance, not signal.
PMU comparison — Phase 6.3 baseline (post-cache-fix) vs followups
Same 30s capture, same workload, single bench process, perf-agent eBPF:
Modest, consistent direction across all PMU dimensions: more work per cycle, slightly fewer cache misses, less total work. The HashMap → FxHash swap accounts for most of the instruction-count drop.
Profile attribution restored
Pre-1a:
handle_tcp_frameat 25% flat — the function was a 595-line monolith and the profiler couldn't attribute internal cost.Post-followups:
handle_tcp_frameat 27.59% flat (the +2.6 pp delta is the#[inline(never)]cost), but the SYN setup path now shows up ashandle_tcp_syn_outboundin flow-creation benches, separately attributed.Test plan
cargo fmt --all -- --checkcargo clippy --workspace --all-targets --all-features -- -D warningsRUSTDOCFLAGS="-D warnings" cargo doc --no-deps --all-featurescargo test --test network_baseline -- --test-threads=1— 24/24cargo test --test network_baseline --features bench-helpers -- --test-threads=1— 26/26 (3/3 stable runs)scripts/bench-compare.sh --baseline origin/main --skip-vm— table aboveFollow-ups (not in this PR)
docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md§ "passt head-to-head methodology").handle_tcp_frameif profiling against passt shows we still trail meaningfully on CRR.