SLIRP refactor — lift passt patterns into our smoltcp stack (Phases 0–5 + 5.5b) by dpsoft · Pull Request #68 · the-void-ia/void-box

dpsoft · 2026-04-29T23:07:45Z

Status: Ready for review. Phases 0–5 + 5.5b of docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md complete on this branch. Phase 6 (TCP half-close, async connect, window mgmt, event-driven polling) is scoped as an overview plan in docs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.md and shipped as separate PRs.

Supersedes the design in docs/superpowers/plans/2026-04-12-network-backend-abstraction.md — instead of adopting passt as an external backend, this branch lifts its design patterns into our existing all-Rust smoltcp-based SLIRP stack. Keeps the all-Rust differentiator (full observability via tracing, debugger works, no opaque process boundary) while picking up passt's correctness wins (ICMP, full UDP, sequence-mirroring TCP, port-forwarding, stateless NAT).

What's in this branch

Phase 0 — Baseline + `NetworkBackend` trait

tests/network_baseline.rs — unit-level pins driving SlirpBackend directly (no VM), including BROKEN_ON_PURPOSE assertions that flip in later phases.
benches/network.rs — divan microbenches for SLIRP hot paths.
voidbox-network-bench — wall-clock harness with metric names matching passt's published table.
NetworkBackend trait extracted; SlirpStack → SlirpBackend (role-based name).
Per-poll Vec<Vec<u8>> allocation dropped (drain_to_guest(&mut Vec) out-param).

Phase 1 — ICMP echo

SOCK_DGRAM IPPROTO_ICMP host socket (unprivileged). Sysctl-restricted hosts get warn-once + drop fallback. BROKEN_ON_PURPOSE pin flipped: icmp_echo_returns_reply positively asserts the round-trip.

Phase 2 — Generalized UDP

Per-flow connected-socket NAT for any UDP destination (was port-53-only). DNS retains its dedicated cached fast-path. 60 s idle reap. BROKEN_ON_PURPOSE pin flipped.

Phase 3 — TCP relay rewrite

passt-style sequence mirroring: recv(MSG_PEEK) for host→guest, no userspace buffer. ACK-driven kernel consume. guest→host now uses TCP backpressure (don't-ACK-on-EAGAIN) instead of a 256 KB userspace cap. to_guest/to_host/to_host_pending_ack/MAX_TO_HOST_BUFFER all deleted. BROKEN_ON_PURPOSE pin flipped: tcp_writes_more_than_256kb_succeed pushes 1 MB through the relay.

Phase 4 — Unified flow table

FlowKey / FlowEntry enums collapse the parallel tcp_nat, udp_flows, icmp_echo tables into a single HashMap<FlowKey, FlowEntry>. Each protocol's relay loop iterates the unified table with a matches! filter. Establishes the substrate for Phase 5+ stateless NAT and Phase 6 epoll dispatch.

Phase 5 — Stateless NAT translation refactor

New src/network/nat.rs with translate_outbound(rules, dst, port, gateway) — pure function, no state, no allocation in the hot path.
Rules carries gateway_loopback, deny_cidrs, and port_forwards. The TCP and UDP guest-frame handlers consume it through the same code path.
deny_list field dropped from SlirpBackend (subsumed by nat::Rules.deny_cidrs).

Phase 5.5b — Inbound port-forwarding

SlirpBackend::with_security spawns one host listener thread per TCP port-forward rule. On accept, the listener pushes through an mpsc channel; process_pending_inbound_accepts (called at the top of drain_to_guest) seeds a SynSent flow-table entry and synthesizes the inbound SYN to the guest. New TcpNatState::SynSent handles the resulting SYN-ACK transition to Established.
Wall-clock latency bench port_forward_accept_latency (~50 ms ceiling, bounded by PORT_FORWARD_POLL_INTERVAL).

Bench harness + comparison

cargo bench --bench network --features bench-helpers — divan microbenches.
voidbox-network-bench --output report.json [--bulk-mb N] — wall-clock harness, JSON report.
scripts/bench-compare.sh --baseline <ref> [--skip-vm] [--skip-divan] — markdown comparison of HEAD vs any baseline ref. Falls back gracefully when the baseline ref pre-dates bench-helpers (Phase 5.5b).

Reviewer-driven fixes (this PR)

Bundled correctness improvements from the Copilot review on this PR:

C1.2 — icmp_echo_returns_reply now probes socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP) once and asserts on the round-trip; only skips on EPERM/EACCES (sysctl restriction).
C1.3 — measure_tcp_throughput_g2h and measure_bulk_throughput_g2h now continue on guest non-zero exit so failed iterations don't skew the mean.
C2.1 / C2.5 / C2.6 — three accept-thread leak patterns in voidbox-network-bench bound with accept_with_deadline so spawned echo/drain threads exit cleanly when the guest never connects.
C2.2 — measure_icmp_rr_latency dropped its guest-side ping invocation (the test/production images intentionally omit /bin/ping); the metric is now an explicit None with a tracing warning until a host-driven measurement path lands in Phase 6.
C2.4 — benches/network.rs migrated from the deprecated SlirpBackend::poll() (per-call Vec<Vec<u8>> alloc) to drain_to_guest(&mut Vec<Vec<u8>>) with a reused buffer; #![allow(deprecated)] removed.

Bonus fixes (caught while debugging)

setuid busybox in test image (commit 77dfc67): chmod u+s on user-packed cpio dropped PID 1 euid to uid 1000, breaking CAP_NET_ADMIN-needing setup_network. Reverted.
vsock backend snapshot mismatch (commit 0d0ab20): voidbox-startup-bench's capture_snapshot was missing .enable_snapshots(true), so cold-boot used Vhost and warm-restore (always Userspace) failed. Fixed; continue-on-error: true removed from the bench step (was masking this on main).

Non-negotiable invariants (preserved through every phase)

All-Rust — no opaque process in the data path.
Full observability via tracing — every state transition emits a structured event.
cargo test-driveable — every behavior change exercised by tests/network_baseline.rs without a VM.
Standard Rust tooling — LSP, clippy, profiler keep working.

CI changes

voidbox-network-bench --iterations 3 step in startup-bench.yml, no continue-on-error.
continue-on-error: true removed from the voidbox-startup-bench wall-clock step.
Divan network microbenches added to the regression gate (Phase 0 task 0B.4).

Validation

Suite	Status
`cargo fmt --all -- --check`	✅
`cargo clippy --workspace --all-targets --all-features -- -D warnings`	✅
`cargo test --workspace --all-features`	✅
`cargo test --test network_baseline`	✅
`cargo bench --bench network --features bench-helpers`	✅
`cargo test --test snapshot_integration -- --ignored`	✅
`cargo test --test e2e_mount -- --ignored`	✅
`voidbox-network-bench --iterations 3`	✅ TCP g2h ~1885 Mbps, RR p50 2 µs, CRR p50 ~10 ms
`voidbox-startup-bench --iters 3 --breakdown`	✅ cold p50 ~236 ms, warm p50 ~76 ms

Bench comparison

scripts/bench-compare.sh --baseline <Phase-0B.1-tip> --skip-vm (the earliest commit that has the bench file):

Bench	Phase 0 baseline	Phase 5.5b HEAD	Δ
`process_syn`	23.6 µs	21.7 µs	-8%
`process_arp_request`	5.06 µs	4.97 µs	-1.9%
`poll_idle`	69.8 ns	170 ns	+100 ns (sub-µs absolute, due to flow_table iteration + accept-channel drain at top of `drain_to_guest`)
new benches (e.g. `tcp_bulk_throughput_1mb`, `port_forward_accept_latency`, `nat_translate_outbound_hot_path`, `process_udp_frame`, `process_icmp_echo_request`, …)	—	numerical baseline	new capability — see file

The work's wins are primarily capability (ICMP/UDP/inbound port-forwarding work where they didn't before; the 256 KB TCP cliff is gone) and architecture (unified flow table, stateless NAT). Raw guest→host throughput is roughly flat (~1885 Mbps), which is expected — passt-style refactoring buys correctness and observability, not headline bytes/sec.

What's NOT in this branch (deferred)

Phase 6 — docs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.md — overview of TCP half-close, async connect, window mgmt, event-driven polling. Per-subsystem TDD plans deferred to when each kicks off.
IPv6 dual-stack — Phase 7, separately scoped.
Pre-existing conformance failures — 3 tests fail (use /tmp paths rejected by PR hardening: openat2-based symlink-escape guard for privileged FS RPCs (R-B2.1 + R-B2.2) #56's fs_guard); confirmed pre-existing on main, not Phase X regressions.
passt head-to-head benchmark — needs passt+qemu reference env; separate task.

Test plan for review

Run cargo test --test network_baseline — pins pass, the three flipped BROKEN_ON_PURPOSE pins are now positive assertions.
Run cargo bench --bench network --features bench-helpers tcp_bulk_throughput_1mb — should produce a measurement.
Run voidbox-network-bench --iterations 3 --bulk-mb 10 — both throughput numbers (constrained vs unconstrained) without errors.
Run voidbox-startup-bench --iters 3 --breakdown — warm phase exits 0.
Run scripts/bench-compare.sh --baseline 41c8382 --skip-vm — validates the comparison harness.
Skim docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md and 2026-04-30-smoltcp-passt-port-phase6.md for the design rationale and follow-up scope.

Adds two planning docs under docs/superpowers/plans/: - 2026-04-27-smoltcp-passt-port.md (spec) Supersedes the 2026-04-12 network-backend-abstraction design. Replaces "add passt as opt-in backend" with "lift passt's design patterns into our smoltcp stack" — keeps observability, all-Rust path, single binary, cross-platform parity. Lists required skills for execution (rust-style, rustdoc, rust-analyzer-ssr, superpowers TDD/verification, repo verify/profile). Maps the work into 5+1 phases with per-phase plan-doc placeholders. - 2026-04-27-smoltcp-passt-port-phase0.md (Phase 0 plan) 25 bite-sized TDD tasks: correctness baseline pins, divan microbenches, wall-clock e2e harness, NetworkBackend trait extraction, SlirpStack → SmoltcpBackend rename. Includes three BROKEN_ON_PURPOSE assertions that flip in later phases.

…ddress

Add two baseline tests for the smoltcp DNS proxy: - dns_query_resolves: sends a query for example.com, polls ≤20×100ms, asserts reply XID matches. - dns_cache_keys_by_question_not_xid: warms cache with xid=1, then queries with xid=2 and asserts the stack rewrites the reply XID. Both tests skip gracefully (eprintln + early return) when the upstream resolver is unreachable, making them safe in offline CI. Also adds QNAME_EXAMPLE_COM const and two module-scope helpers: build_dns_query (builds a correct UDP DNS frame with proper payload_len) and parse_dns_reply_xid. SLIRP_DNS_IP added to the existing module-scope slirp import.

Implement measure_tcp_throughput_g2h: binds a host-side TCP listener, boots a VM, execs dd|nc in the guest, drains to EOF on the host, and computes Mbps from bytes_received / elapsed. h2g left None with a TODO.

Implements measure_rr_latency and measure_crr_latency in voidbox-network-bench, reusing the single shared VM booted for throughput measurements. RR: guest pipes N bytes over one persistent nc connection; host times each read+write pair (first sample discarded to absorb connect jitter). CRR: guest runs N independent nc invocations; host times each full accept+read+write+close cycle. Both use the existing percentile() helper (dead_code attribute removed). Latency measurements always run regardless of --no-throughput.

Per user feedback: "Slirp" denotes the user-mode-NAT role; "smoltcp" is the underlying library. Role-based naming keeps the public type surface stable across library swaps and matches the symmetry of future TapBackend / VhostNetBackend siblings. Module file src/network/slirp.rs keeps its name (already aligned with the new type, matches src/devices/virtio_net.rs convention).

The actual polling logic now lives in drain_to_guest, which writes directly into the caller-supplied &mut Vec<Vec<u8>> buffer — no fresh allocation on every tick. poll becomes a #[deprecated] shim: #[deprecated(note = "use drain_to_guest")] pub fn poll(&mut self) -> Vec<Vec<u8>> { let mut out = Vec::new(); self.drain_to_guest(&mut out); out } Existing call sites (virtio_net.rs, tests/network_baseline.rs, benches/network.rs) are annotated with #[allow(deprecated)] and a TODO(0D.4/0D.5) marker. They will be migrated in the next two tasks, after which the allow attributes can be removed.

Switch VirtioNetDevice::slirp from Arc<Mutex<SlirpStack>> to Arc<Mutex<dyn NetworkBackend>>, replacing the deprecated poll() call in get_rx_frames with drain_to_guest into a reused rx_scratch buffer. Update both VMM cold-boot and snapshot-restore construction sites to coerce Arc<Mutex<SlirpStack>> to the trait object. All 14 baseline tests pass; fmt and clippy clean.

Type rename only — the slirp.rs module file keeps its name. SlirpBackend reflects the user-mode-NAT role rather than the underlying smoltcp library, keeping naming symmetric with future TapBackend / VhostNetBackend siblings.

Introduces the types and helper needed for ICMP echo NAT (Phase 1): - IcmpEchoKey {guest_id, dst_ip}: hash key for the echo NAT table. - IcmpEchoEntry {sock, guest_id, last_activity}: per-request state. - open_icmp_socket(): opens SOCK_DGRAM/IPPROTO_ICMP (no CAP_NET_RAW). - icmp_echo: HashMap<IcmpEchoKey, IcmpEchoEntry> field on SlirpBackend, initialized to HashMap::new() in with_security() (the canonical ctor; new() and Default both delegate through it). No behavior change — handle_ipv4_frame is untouched, the map stays empty. Dead-code allowances are scoped to the new items and will be removed once tasks 1.2/1.3 wire them in.

Add the SynSent state to TcpNatState for host-initiated (port-forward) connections. When handle_tcp_frame sees SYN+ACK on a SynSent entry it sends an ACK to the guest, advances our_seq, records guest_ack, and transitions to Established — completing the inbound 3-way handshake. Add #[cfg(test)] helpers on SlirpBackend (insert_synthetic_synsent_entry, tcp_flow_state, injected_plain_ack_count) and a unit test tcp_inbound_syn_ack_completes_handshake that seeds a SynSent entry, feeds a guest SYN-ACK, and asserts (a) state → Established and (b) one plain ACK queued for injection. The full E2E contract is deferred to task 5.5b.5 (tcp_port_forward_inbound in tests/network_baseline.rs). build_tcp_packet_static signature: (src_ip, dst_ip, src_port, dst_port, seq, ack, control, payload). The inbound ACK uses src=SLIRP_GATEWAY_IP, dst=SLIRP_GUEST_IP, src_port=key.dst_port (high port), dst_port= key.guest_src_port.

Adds a divan bench for the SynSent → Established state-machine path introduced in 5.5b.1. The bench seeds one synthetic SynSent entry, feeds a SYN-ACK frame to process_guest_frame, and measures the transition cost (~42 µs median, same order as process_syn). Approach (option a): widen the three #[cfg(test)] helpers on SlirpBackend to #[cfg(any(test, feature = "bench-helpers"))]. insert_synthetic_synsent_entry is promoted to `pub` within the gated impl block so the bench binary (a separate compilation unit) can call it. The feature is never enabled in production builds. All helpers in benches/network.rs that are only needed under bench-helpers are gated with #[cfg(feature = "bench-helpers")] to keep the default bench binary warning-free.

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-30T14:16:49Z

+            let mut table: HashMap<usize, u32> = HashMap::with_capacity(n);
+            // Insert phase
+            for (i, _key) in keys.iter().enumerate() {
+                table.insert(i, i as u32);
+            }
+            // Remove phase
+            for i in 0..n {
+                divan::black_box(table.remove(&i));


flow_table_insert_remove pre-builds a keys vector but never uses it; the bench actually measures HashMap<usize, u32> insert/remove with integer keys. That doesn’t reflect the stated intent (unified flow-table key hashing/dispatch) and makes the result hard to interpret. Either remove the unused keys construction or use a key type representative of the real flow table (e.g., the actual FlowKey or a synthetic struct with similar hashing cost).

Suggested change

let mut table: HashMap<usize, u32> = HashMap::with_capacity(n);

// Insert phase

for (i, _key) in keys.iter().enumerate() {

table.insert(i, i as u32);

}

// Remove phase

for i in 0..n {

divan::black_box(table.remove(&i));

let mut table: HashMap<smoltcp::wire::IpAddress, u32> = HashMap::with_capacity(n);

// Insert phase

for (i, key) in keys.iter().copied().enumerate() {

table.insert(key, i as u32);

}

// Remove phase

for key in keys.iter() {

divan::black_box(table.remove(key));

Copilot · 2026-04-30T14:16:50Z

+        // Drain backend frames into the reused scratch buffer.
+        self.rx_scratch.clear();
+        {
+            let mut backend = self.slirp.lock().unwrap();
+            backend.drain_to_guest(&mut self.rx_scratch);
+        }
+        let frames = std::mem::take(&mut self.rx_scratch);



get_rx_frames clears rx_scratch, fills it, then mem::takes it. That replaces the field with a fresh empty Vec (capacity 0), so the next poll re-allocates and the scratch buffer is not actually reused as the comment claims. Consider draining from rx_scratch into frames/result (e.g., drain(..)), or swapping while preserving capacity, so the allocation is retained across calls.

Copilot · 2026-04-30T14:16:50Z

+/// Unlike `build_udp_frame` (which carries a pre-existing off-by-one in
+/// the `payload_len` argument passed to `udp_repr.emit`), this helper
+/// passes only the DNS payload length so the UDP `len` field is correct
+/// and the stack's smoltcp parser accepts the frame.


The comment claims build_udp_frame has an off-by-one bug in the payload_len passed to udp_repr.emit, but build_udp_frame currently passes payload.len() (which is the correct smoltcp API expectation). This mismatch is likely to confuse future refactors—either fix the comment or adjust build_udp_frame if it’s actually wrong.

Suggested change

/// Unlike `build_udp_frame` (which carries a pre-existing off-by-one in

/// the `payload_len` argument passed to `udp_repr.emit`), this helper

/// passes only the DNS payload length so the UDP `len` field is correct

/// and the stack's smoltcp parser accepts the frame.

/// This helper passes the DNS payload length expected by `udp_repr.emit`,

/// so the UDP `len` field is correct and the stack's smoltcp parser

/// accepts the frame.

Copilot · 2026-04-30T14:16:50Z

+        unsafe {
+            libc::setsockopt(
+                listener.as_raw_fd(),
+                libc::SOL_SOCKET,
+                libc::SO_RCVBUF,
+                &val as *const libc::c_int as *const libc::c_void,
+                std::mem::size_of::<libc::c_int>() as libc::socklen_t,
+            );


The setsockopt(SO_RCVBUF) call’s return value is ignored. If this fails (e.g., unsupported or restricted), the test won’t actually exercise the intended backpressure scenario but will still pass/fail based on unrelated behavior. Please check the return code (and/or last_os_error) and either assert or skip the test when the option can’t be applied.

Suggested change

unsafe {

libc::setsockopt(

listener.as_raw_fd(),

libc::SOL_SOCKET,

libc::SO_RCVBUF,

&val as *const libc::c_int as *const libc::c_void,

std::mem::size_of::<libc::c_int>() as libc::socklen_t,

);

let rc = unsafe {

libc::setsockopt(

listener.as_raw_fd(),

libc::SOL_SOCKET,

libc::SO_RCVBUF,

&val as *const libc::c_int as *const libc::c_void,

std::mem::size_of::<libc::c_int>() as libc::socklen_t,

)

};

if rc != 0 {

eprintln!(

"skipping tcp_writes_more_than_256kb_succeed: failed to set SO_RCVBUF: {}",

std::io::Error::last_os_error()

);

return;

Adds the host listener thread infrastructure for Phase 5.5b inbound TCP port-forwarding. No listener is spawned yet (that is task 5.5b.4), so there is no behavior change in this commit. New items in src/network/slirp.rs: - `InboundAccept` struct (pub(crate)) — channel payload from listener to net-poll - `port_forward_listeners: Vec<JoinHandle<()>>` field — holds spawn handles - `port_forward_shutdown: Arc<AtomicBool>` field — graceful-shutdown signal - `pending_inbound_accepts: mpsc::Receiver<InboundAccept>` field — accept channel rx - `accept_sender: mpsc::Sender<InboundAccept>` field — keeps channel open + test helper - `process_pending_inbound_accepts()` — drains channel, inserts SynSent entries, queues SYNs - `run_port_forward_listener()` — module-scope thread fn, nonblocking accept loop - `spawn_port_forward_listeners()` — pub(crate) factory, not called until 5.5b.4 - `Drop for SlirpBackend` — sets shutdown flag, joins all listener handles - `push_inbound_accept()` on test-only impl — injects accepts for unit tests - `drain_to_guest` now calls `process_pending_inbound_accepts()` as step 0 TDD: test `process_pending_inbound_accepts_seeds_synsent_and_queues_syn` written and watched fail before implementation; passes in GREEN. 17/17 network_baseline integration tests unchanged.

Call `spawn_port_forward_listeners` in `SlirpBackend::with_security` so host listener threads are actually spawned when `nat.port_forwards` is non-empty. The function now also returns the `Sender` end of the accept channel so `accept_sender` (needed to keep the channel open in tests) is sourced from the same channel pair as `pending_inbound_accepts`. Remove the `#[allow(dead_code)]` attrs from both functions. Unit test `with_security_spawns_listener_per_tcp_port_forward` confirms zero threads for empty rules and one thread per TCP rule.

Adds tcp_port_forward_inbound_connect_succeeds to tests/network_baseline.rs. The test builds a SlirpBackend with one port-forward rule (18080→8080), drives drain_to_guest in a loop while a host thread connects to 127.0.0.1:18080, synthesizes a guest listener by responding with SYN-ACK to the SYN the stack emits, and asserts three contract points: 1. host TcpStream::connect succeeds — listener thread (5.5b.3) is alive. 2. drain_to_guest emits a synthesized SYN to GUEST_PORT — InboundAccept channel + process_pending_inbound_accepts + synthesize_inbound_syn (5.5b.2/5.5b.3/5.5b.4) all fired. 3. drain_to_guest emits the completing ACK after our SYN-ACK — the SynSent → Established arm (5.5b.1) fired. Also adds parse_tcp_to_guest_full helper (superset of parse_tcp_to_guest that also returns src/dst ports, needed to identify the ephemeral high_port in the synthesized SYN). No VM, no --ignored flag, completes in ~0.1 s.

…aseline Times the full inbound port-forward path: host TcpStream::connect → listener thread accept() → mpsc channel → process_pending_inbound_accepts → synthesize_inbound_syn → drain_to_guest output. Bounded above by PORT_FORWARD_POLL_INTERVAL (50ms). Regressions in the inbound state machine or listener poll loop now surface numerically against this baseline.

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-30T15:42:16Z

+    // Push 1 MB in 1 KB chunks. Drain after every batch so the
+    // host's read thread can drain the kernel buffer and ACKs flow
+    // back to the guest. The new TCP-backpressure path means some
+    // chunks won't be ACK'd immediately; we re-send those (TCP-style
+    // retransmit) until they go through.
+    const TOTAL: usize = 1024 * 1024;
+    const CHUNK: usize = 1024;
+    let chunk = vec![b'x'; CHUNK];
+    let mut seq = 1001u32;
+    let mut acked_seq = 1001u32;
+    let mut saw_close = false;
+    let deadline = std::time::Instant::now() + std::time::Duration::from_secs(10);
+
+    while bytes_received.load(Ordering::Relaxed) < TOTAL && std::time::Instant::now() < deadline {
+        // Send a chunk; advance our seq.
+        let _ = stack.process_guest_frame(&build_tcp_frame(
+            SLIRP_GATEWAY_IP,
+            GUEST_EPHEMERAL_PORT,
+            host_port,
+            seq,
+            our_seq + 1,
+            TcpControl::Psh,
+            &chunk,
+        ));
+        seq = seq.wrapping_add(CHUNK as u32);
+


In tcp_writes_more_than_256kb_succeed, the loop unconditionally increments seq after each send, but it never actually retransmits when the stack stops ACKing (backpressure). This contradicts the test’s own comment (“we re-send those”) and can make the test either flaky or fail to validate the intended behavior. Consider only advancing seq once the corresponding ACK is observed (or explicitly resend the same seq/payload when acked_seq hasn’t advanced).

Copilot · 2026-04-30T15:42:17Z

+                            exit_code = ?output.exit_code,
+                            stderr = output.stderr_str(),
+                            "g2h iteration non-zero exit; skipping"
+                        );


In measure_tcp_throughput_g2h, if the guest nc command exits non-zero you log a warning but still proceed to read the drain-thread result and may include that iteration in the mean. This can silently skew the reported throughput downward (or mask real failures). Consider continue-ing on !output.success() so only successful iterations contribute to the metric (and similarly for the bulk throughput path).

Suggested change

);

);

continue;

Compares HEAD against an arbitrary baseline ref using the two bench harnesses: divan microbenches (cargo bench --bench network) and the VM-backed wall-clock harness (voidbox-network-bench). Emits markdown with absolute numbers + percent deltas, suitable for PR descriptions. Replaces the scattered /tmp/baseline-network-phase*.json files with a reproducible single entry-point.

Copilot

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-30T21:06:05Z

+            // The host accepts CRR_SAMPLES_PER_ITER connections, times each cycle,
+            // and sends results back over a channel.
+            let (crr_tx, crr_rx) = mpsc::channel::<Vec<Duration>>();
+            let sample_count = CRR_SAMPLES_PER_ITER;
+
+            std::thread::spawn(move || {
+                let samples = crr_echo_server(&listener, sample_count);
+                let _ = crr_tx.send(samples);
+            });


Same accept-thread leak pattern in CRR: the spawned thread blocks on listener.accept() and can persist past the iteration if the guest loop errors or no connections arrive, leaking resources and potentially stalling later runs. Use a bounded accept loop with timeout/cancellation, or accept in the main async task via a non-blocking listener.

Copilot · 2026-04-30T21:06:06Z

+
+parse_divan() {
+  local file="$1"
+  LC_ALL=en_US.UTF-8 awk -F'│' '


LC_ALL=en_US.UTF-8 is not guaranteed to exist on minimal distros/CI images, which can make parse_divan fail even when awk is present. Consider using a more universally available UTF-8 locale like C.UTF-8 (with a fallback to C) or avoid locale-dependence by normalizing the µs unit in the input before parsing.

Suggested change

LC_ALL=en_US.UTF-8 awk -F'│' '

local awk_locale="C"

if command -v locale >/dev/null 2>&1; then

if locale -a 2>/dev/null | grep -Fxq 'C.UTF-8'; then

awk_locale='C.UTF-8'

elif locale -a 2>/dev/null | grep -Fxq 'C.utf8'; then

awk_locale='C.utf8'

elif locale -a 2>/dev/null | grep -Fxq 'en_US.UTF-8'; then

awk_locale='en_US.UTF-8'

fi

fi

LC_ALL="$awk_locale" awk -F'│' '

Copilot · 2026-04-30T21:06:07Z

+            let (drain_tx, drain_rx) = mpsc::channel::<(u64, Duration)>();
+            std::thread::spawn(move || {
+                let drain_result = drain_one_connection(&listener);
+                let _ = drain_tx.send(drain_result);
+            });


Same accept-thread leak pattern as measure_tcp_throughput_g2h: the spawned thread blocks on listener.accept() and can outlive the iteration on exec failure or recv_timeout, accumulating stuck threads and open FDs. Use a bounded accept with timeout/cancellation so the harness remains stable under failure conditions.

Copilot · 2026-04-30T21:06:07Z

+            let (echo_tx, echo_rx) = mpsc::channel::<Vec<Duration>>();
+
+            std::thread::spawn(move || {
+                let samples = rr_echo_server(&listener, RR_SAMPLES_PER_ITER);
+                let _ = echo_tx.send(samples);
+            });


Same issue here: the RR echo server thread blocks on listener.accept() and isn’t joined/cancelled. If the guest nc fails to connect, the thread can hang indefinitely holding the listener, which can leak threads across iterations. Prefer non-blocking accept with a deadline, or a cancellable blocking task.

Reviewer (Copilot) flagged that the previous skip-on-no-reply path masked real ICMP regressions on hosts where unprivileged ICMP works. Probe socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP) once: skip only on EPERM/EACCES (sysctl net.ipv4.ping_group_range forbids it), assert otherwise.

C1.3: measure_tcp_throughput_g2h and measure_bulk_throughput_g2h now `continue` on guest nc non-zero exit so failed iterations don't skew the reported mean. C2.2: measure_icmp_rr_latency dropped its guest-side ping path. The guest images intentionally omit /bin/ping (busybox-static lacks CONFIG_FEATURE_PING_TYPE_DGRAM and SOCK_RAW would need root); the function now returns None with a warn explaining the gap. Proper host-driven measurement is tracked as a follow-up.

Aligns benches with the production RX path. The deprecated poll() allocated a fresh Vec<Vec<u8>> per call; drain_to_guest appends to a caller-owned buffer that's reused across iterations. CI now gates on the same allocator pattern production code uses, removing avoidable allocation overhead from the measurements. Drops the file-level #![allow(deprecated)].

When the baseline ref pre-dates a464fc1 (Phase 5.5b.1) it doesn't have the `bench-helpers` cargo feature, and cargo errored out, dropping the entire baseline divan section to —. Detect the "does not have feature" / "unknown feature" stderr signal and retry without --features bench-helpers. Benches that exist at both refs get real Δ%; the bench-helpers-gated ones (synthesize_inbound_syn, tcp_inbound_syn_ack_transition) naturally remain — for baseline. Unlocks bench-compare.sh --baseline origin/main against the full branch history.

drain_one_connection, rr_echo_server, crr_echo_server now accept with a deadline derived from LATENCY_RECV_TIMEOUT + slack. Previously each spawned thread blocked forever on listener.accept() if the guest nc never connected (exec error, network failure), holding the listener FD across all subsequent iterations and burning thread/FD slots. When the accept deadline lapses, the thread exits cleanly, the listener drops, and the next iteration starts with a clean slate. Addresses Copilot review C2.1, C2.5, C2.6.

… + epoll Scopes the four architectural follow-ups surfaced in the smoltcp-passt-port PR review: - 6.1 (high) : TCP half-close (FinWait*/CloseWait/LastAck) — silent data loss on shutdown(SHUT_WR) today. - 6.2 (med-h) : Async outbound connect — vCPU thread blocked up to 3 s on slow destinations. - 6.3 (med) : Window management + scaling — guest window ignored; advertised window hardcoded 65535. - 6.4 (med-l) : Event-driven RX polling — replace 5 ms timer with epoll_wait. Locks the observability + cross-platform + snapshot invariants from the top-level spec. Per-subsystem TDD task lists deferred to dedicated plans (-phase6.1.md..-phase6.4.md) written before each kicks off.

…ghput vs main) (#69) * docs(plans): smoltcp passt-pattern port — spec + Phase 0 plan Adds two planning docs under docs/superpowers/plans/: - 2026-04-27-smoltcp-passt-port.md (spec) Supersedes the 2026-04-12 network-backend-abstraction design. Replaces "add passt as opt-in backend" with "lift passt's design patterns into our smoltcp stack" — keeps observability, all-Rust path, single binary, cross-platform parity. Lists required skills for execution (rust-style, rustdoc, rust-analyzer-ssr, superpowers TDD/verification, repo verify/profile). Maps the work into 5+1 phases with per-phase plan-doc placeholders. - 2026-04-27-smoltcp-passt-port-phase0.md (Phase 0 plan) 25 bite-sized TDD tasks: correctness baseline pins, divan microbenches, wall-clock e2e harness, NetworkBackend trait extraction, SlirpStack → SmoltcpBackend rename. Includes three BROKEN_ON_PURPOSE assertions that flip in later phases. * test(network): scaffold network_baseline pins with frame helpers * test(network): address review — restore reserved constants, alias IpAddress * test(network): pin TCP handshake SYN-ACK emission * test(network): pin TCP guest↔host data round-trip * test(network): BROKEN_ON_PURPOSE pin — 256 KB to_host cliff * test(network): hoist inline `use` statements to module scope * test(network): pin TCP rate limit, concurrent cap, deny list * test(network): pin ARP reply behavior for gateway and subnet * test(network): pin DNS resolution and cache xid-rewrite Add two baseline tests for the smoltcp DNS proxy: - dns_query_resolves: sends a query for example.com, polls ≤20×100ms, asserts reply XID matches. - dns_cache_keys_by_question_not_xid: warms cache with xid=1, then queries with xid=2 and asserts the stack rewrites the reply XID. Both tests skip gracefully (eprintln + early return) when the upstream resolver is unreachable, making them safe in offline CI. Also adds QNAME_EXAMPLE_COM const and two module-scope helpers: build_dns_query (builds a correct UDP DNS frame with proper payload_len) and parse_dns_reply_xid. SLIRP_DNS_IP added to the existing module-scope slirp import. * test(network): fix build_udp_frame payload_len double-count * test(network): BROKEN_ON_PURPOSE pin — UDP non-DNS dropped * test(network): BROKEN_ON_PURPOSE pin — ICMP echo dropped * bench(network): divan microbenches for SLIRP hot paths * bench(network): parametric NAT-walk scaling at 1/100/1000 flows * bench(network): DNS cache hit and miss paths * ci(bench): include network microbenches in regression gate * bench(network): voidbox-network-bench binary scaffold * bench(network): TCP throughput via busybox nc Implement measure_tcp_throughput_g2h: binds a host-side TCP listener, boots a VM, execs dd|nc in the guest, drains to EOF on the host, and computes Mbps from bytes_received / elapsed. h2g left None with a TODO. * bench(network): TCP RR/CRR latency p50/p99 Implements measure_rr_latency and measure_crr_latency in voidbox-network-bench, reusing the single shared VM booted for throughput measurements. RR: guest pipes N bytes over one persistent nc connection; host times each read+write pair (first sample discarded to absorb connect jitter). CRR: guest runs N independent nc invocations; host times each full accept+read+write+close cycle. Both use the existing percentile() helper (dead_code attribute removed). Latency measurements always run regardless of --no-throughput. * bench(network): UDP DNS qps and JSON report output * docs(plans): rename SmoltcpBackend → SlirpBackend in spec + Phase 0 plan Per user feedback: "Slirp" denotes the user-mode-NAT role; "smoltcp" is the underlying library. Role-based naming keeps the public type surface stable across library swaps and matches the symmetry of future TapBackend / VhostNetBackend siblings. Module file src/network/slirp.rs keeps its name (already aligned with the new type, matches src/devices/virtio_net.rs convention). * feat(network): introduce NetworkBackend trait * refactor(slirp): add drain_to_guest wrapper for trait fit * refactor(slirp): move poll body into drain_to_guest, drop alloc The actual polling logic now lives in drain_to_guest, which writes directly into the caller-supplied &mut Vec<Vec<u8>> buffer — no fresh allocation on every tick. poll becomes a #[deprecated] shim: #[deprecated(note = "use drain_to_guest")] pub fn poll(&mut self) -> Vec<Vec<u8>> { let mut out = Vec::new(); self.drain_to_guest(&mut out); out } Existing call sites (virtio_net.rs, tests/network_baseline.rs, benches/network.rs) are annotated with #[allow(deprecated)] and a TODO(0D.4/0D.5) marker. They will be migrated in the next two tasks, after which the allow attributes can be removed. * feat(slirp): impl NetworkBackend for SlirpStack * refactor(virtio_net): hold dyn NetworkBackend, reuse rx buffer Switch VirtioNetDevice::slirp from Arc<Mutex<SlirpStack>> to Arc<Mutex<dyn NetworkBackend>>, replacing the deprecated poll() call in get_rx_frames with drain_to_guest into a reused rx_scratch buffer. Update both VMM cold-boot and snapshot-restore construction sites to coerce Arc<Mutex<SlirpStack>> to the trait object. All 14 baseline tests pass; fmt and clippy clean. * refactor(network): rename SlirpStack to SlirpBackend Type rename only — the slirp.rs module file keeps its name. SlirpBackend reflects the user-mode-NAT role rather than the underlying smoltcp library, keeping naming symmetric with future TapBackend / VhostNetBackend siblings. * docs(plans): add Phase 1 plan (ICMP echo via SOCK_DGRAM IPPROTO_ICMP) * feat(slirp): add IcmpEchoEntry + IPPROTO_ICMP socket helper Introduces the types and helper needed for ICMP echo NAT (Phase 1): - IcmpEchoKey {guest_id, dst_ip}: hash key for the echo NAT table. - IcmpEchoEntry {sock, guest_id, last_activity}: per-request state. - open_icmp_socket(): opens SOCK_DGRAM/IPPROTO_ICMP (no CAP_NET_RAW). - icmp_echo: HashMap<IcmpEchoKey, IcmpEchoEntry> field on SlirpBackend, initialized to HashMap::new() in with_security() (the canonical ctor; new() and Default both delegate through it). No behavior change — handle_ipv4_frame is untouched, the map stays empty. Dead-code allowances are scoped to the new items and will be removed once tasks 1.2/1.3 wire them in. * refactor(slirp): hoist FromRawFd to module scope, drop redundant use libc * feat(slirp): forward guest ICMP echo via SOCK_DGRAM IPPROTO_ICMP * feat(slirp): relay ICMP echo replies back to guest Add `relay_icmp_echo` which drains replies from each active ICMP echo socket and injects Ethernet/IPv4/ICMP echo-reply frames back into the guest. Add `build_icmp_echo_reply_to_guest` which parses the raw ICMP payload from the `SOCK_DGRAM IPPROTO_ICMP` socket, rewrites the ident back to the guest's original value, and builds a complete wire frame. Wire both into `drain_to_guest` immediately after `relay_tcp_nat_data`. Drop the now-stale `#[allow(dead_code)]` on `IcmpEchoEntry::guest_id` which is read by `relay_icmp_echo`. * feat(slirp): warn-once + fallback when unprivileged ICMP forbidden * test(network): flip ICMP pin — assert echo reply (was BROKEN_ON_PURPOSE) Renames `icmp_echo_silently_dropped` → `icmp_echo_returns_reply`. Targets 127.0.0.1 (loopback), polls 20 × 50ms for the reply, and skips via eprintln! if sysctl forbids unprivileged ICMP — consistent with how `dns_query_resolves` handles offline environments. * bench(network): populate ICMP RR latency p50 Add measure_icmp_rr_latency() to voidbox-network-bench. Runs busybox ping -c <count> -W 1 -i 0.05 8.8.8.8 inside the guest, parses time=<ms> fields, converts to microseconds, and returns the p50 median. Falls back to None + WARN on non-zero exit or empty parse (unreachable network). Wired into main after measure_dns_qps; always runs regardless of --no-throughput. Also: allow unprivileged ICMP sockets in guest-agent (ping_group_range) and add ping + setuid busybox to the test initramfs build. * fix(scripts): revert setuid busybox in test image (Phase 1.6 regression) Phase 1.6 (commit 8572122) added `chmod u+s "$OUT_DIR/bin/busybox"` to let busybox `ping` open SOCK_RAW. The unintended consequence: cpio is packed as the build user (uid 1000), so the kernel drops euid to 1000 on every execve from PID 1. In `guest-agent::setup_network`, that meant `ip link up`, `ip addr replace`, and `udhcpc` all silently failed with EPERM (no CAP_NET_ADMIN). The static-fallback loop wasted 10s of boot time. Combined with the vsock listener creation retry, total guest-agent startup exceeded the host's 30s control-channel handshake deadline → ECONNRESET on every connect → `voidbox-network-bench` and any test using `network(true)` failed with `control_channel: deadline reached`. Verification: - With setuid: bench fails consistently after 122 connect attempts. - Without setuid: bench produces clean numbers matching Phase 0 baseline (TCP RR p50=2µs, CRR p50=10176µs, DNS qps=0.5). The `ping` symlink is also dropped because busybox-static on Fedora is not built with CONFIG_FEATURE_PING_TYPE_DGRAM, so unprivileged ICMP is unavailable to the guest applet regardless. ICMP measurement in voidbox-network-bench now reports `null` cleanly ("ping: not found") until we route ICMP RR through SLIRP from the host instead. The companion `ping_group_range` write in guest-agent stays — it's harmless and supports future tooling that uses SOCK_DGRAM IPPROTO_ICMP. * docs(plans): add Phase 2 plan (generalize UDP via per-flow connected sockets) * feat(slirp): add UdpFlowEntry + per-flow connected socket helper * feat(slirp): forward non-DNS UDP via per-flow connected sockets Implement Task 2.2: route all guest UDP through handle_udp_frame, which creates/reuses a per-flow connected UdpSocket keyed on (guest_src_port, dst_ip, dst_port). DNS to SLIRP_DNS_IP still dispatches to the existing handle_dns_frame. SLIRP_GATEWAY_IP (10.0.2.2) is translated to 127.0.0.1 before connect(), matching the TCP NAT path. Drop #[allow(dead_code)] from UdpFlowEntry (item-level), open_udp_flow_socket, and the udp_flows field — all now consumed. Add a field-targeted #[allow(dead_code)] on last_activity (written here, read by Task 2.4). Flip the udp_non_dns_silently_dropped BROKEN_ON_PURPOSE pin: datagrams now reach the bound host socket, confirming the guest→host send path works. All 14 baseline tests pass. * ci(bench): add strict voidbox-network-bench step (no continue-on-error) Catches setuid-busybox-style regressions that masquerade as "environment flakes". Specifically: the bug fixed at 77dfc67 (Phase 1.6 added `chmod u+s busybox`, dropping PID 1 euid → no CAP_NET_ADMIN → setup_network silently fails → 30s handshake deadline expires → ECONNRESET) would have been visible in CI from the start if the wall-clock harness step weren't behind `continue-on-error: true`. This new step runs `voidbox-network-bench --iterations 3` and publishes the JSON metrics to the step summary. Failure of the harness fails the workflow — no masking. The existing `voidbox-startup-bench` step keeps `continue-on-error` for now because its warm-restore phase has a separate, unfixed issue (`control_channel[multiplex-establish]: deadline reached` reproducible on main); flipping that to strict belongs in the PR that fixes the warm-restore handshake. Vhost-vsock probe still gates the run via `/dev/vhost-vsock` existence check — runners without it skip cleanly with a warning, since absence-of-device is an environment fact, not a regression. * feat(slirp): relay UDP flow replies back to guest Add `relay_udp_flows` and `build_udp_reply_to_guest` to `SlirpBackend`. Each active UDP flow socket is polled non-blocking on every `drain_to_guest` tick; replies are wrapped in an Ethernet/IPv4/UDP frame (src=original-dst, dst=guest) and pushed into `inject_to_guest`. Wire the call into `drain_to_guest` after `relay_icmp_echo`. Also add `UdpRepr` to the smoltcp wire imports and drop the now-consumed `#[allow(dead_code)]` on `UdpFlowEntry::last_activity`. * feat(slirp): UDP flow idle reap (60s) * test(network): full RTT for UDP pin (was BROKEN_ON_PURPOSE one-way) Rename udp_non_dns_silently_dropped → udp_non_dns_round_trips and rewrite the body to verify the complete guest→host→guest round-trip via the per-flow connected-socket NAT landed in Tasks 2.1–2.4. * bench(network): document DNS qps busybox-nc bottleneck (set null + WARN) busybox nc -u -w1 blocks for the full 1-second timeout after stdin EOF even when the cached SLIRP reply arrives in microseconds, capping throughput at ~1 qps. Tighter flags tried: -q0 exits before the reply arrives (0 successes); /dev/udp/ is bash-only; timeout(1) is absent from the test initramfs. Report udp_dns_qps as null with a WARN pointing to the host-side UDP socket path as the correct future fix. Also removes the now-dead DNS_QPS_WINDOW_SECS and SLIRP_DNS_ADDR constants. * fix(startup-bench): require userspace vsock backend for snapshot capture The bench's `capture_snapshot` was building a Sandbox without `.enable_snapshots(true)`, so the backend selector at `backend/kvm.rs:212` chose `VsockBackendType::Vhost` (lower per-RPC latency for cold-only runs). The `create_auto_snapshot` call then captured a vhost-shaped snapshot. But `from_snapshot` always restores into `VsockBackendType::Userspace` — a path that knows how to re-program our process-local vring state, while vhost's vring state lives in the host kernel's `vhost-vsock` module and isn't part of the snapshot at all. Result: the restored userspace device has half-blank state, never accepts connections from the host, every connect attempt is RST'd by the guest kernel, and the multiplex handshake hits its 30s deadline. Symptom across CI and local Fedora bare-metal: control_channel[multiplex-establish]: deadline reached after 123 connect/handshake attempts Error: Guest("control_channel: deadline reached") This same failure was visible in CI run 24983657846 on main (April 27, before any of the SLIRP refactor work) — masked by `continue-on-error: true` on the wall-clock harness step. Removing both: the fix and the CI mask, so a regression of this exact shape would now fail the workflow. Verified locally: `voidbox-startup-bench --iters 3 --breakdown` now exits 0 with `warm.total p50 = 82ms` (well within the CHANGELOG's 138ms target). Cold phase numbers unchanged (~245ms p50). Refs: - backend/kvm.rs:205-216 (the backend selector) - CHANGELOG.md:74 ("Snapshot/Restore for KVM ... userspace virtio-vsock backend") - AGENTS.md:1185 ("snapshot_integration ... Uses userspace virtio-vsock backend") * docs(plans): add Phase 3 plan (TCP relay rewrite via MSG_PEEK + sequence mirroring) * docs(plans): lock observability as a hard non-negotiable invariant Per user pushback ("the improvements based on passt, maintain our differentiator of full observability on the SLIRP implementation, that is a must?") — yes, and it should be stated explicitly, not assumed. Spec gets a "Hard invariant — observability" section right after the motivation. Phase 3 plan gets a "Non-negotiable invariants" block that codifies what every task in the high-risk TCP-relay rewrite must preserve: - All-Rust, no opaque-process boundary; libc syscalls are fine. - tracing instrumentation at every state transition (peek, ACK consume, close); new code must add new events for new state. - cargo-test-driveable behavior via tests/network_baseline.rs. - Standard Rust tooling (LSP, clippy, profiler) keeps working. Future phases inherit the spec-level invariant; their per-phase plans will reiterate the task-level acceptance criteria. * refactor(slirp): add bytes_in_flight to TcpNatEntry (no behavior change) Scaffolding for Task 3.3/3.4: tracks bytes sent to guest but not yet ACK'd. Initialized to 0 at all construction sites; dead_code suppressed until Task 3.3 consumes it. * refactor(slirp): add recv_peek helper using libc::recv MSG_PEEK * refactor(slirp): peek-based host→guest TCP relay (drops to_guest dependency) Replace the read-into-to_guest + drain loop in relay_tcp_nat_data with a MSG_PEEK-based send path. recv_peek() peeks the kernel's recv buffer without consuming it; only the bytes past bytes_in_flight are chunked into TCP segments and injected toward the guest. our_seq and bytes_in_flight advance as segments are sent; the kernel buffer holds the data until Task 3.4's ACK-driven read() consumes it. Remove #[allow(dead_code)] from recv_peek and bytes_in_flight (both now consumed). Add #[allow(dead_code)] to to_guest (still on struct; Task 3.6 removes it). Drop unused Read import. Tracing: trace! per peek+send cycle, debug! on host EOF, warn! on recv_peek errors. * refactor(slirp): ACK-driven consume from kernel socket When the guest ACKs data we relayed via the peek-based host→guest path (Task 3.3), read() those bytes from the host_stream to advance the kernel's recv buffer read pointer. Without this the kernel buffer fills up and recv_peek keeps returning the same already-sent bytes. Logic in handle_tcp_frame, Established branch: - Extract segment_ack from tcp.ack_number().0 as u32. - Compute last_sent_acked = our_seq.wrapping_sub(bytes_in_flight). - acked_bytes = segment_ack.wrapping_sub(last_sent_acked) — wrapping arithmetic throughout because TCP sequence numbers wrap at 2^32. - Guard: only consume when acked_bytes > 0 && <= bytes_in_flight, defending against duplicate/spurious/malformed guest ACKs. - Drain via read() loop into a stack sink; decrement bytes_in_flight by the actual drained count, not the claimed acked_bytes. - tracing::trace! on each consume; tracing::warn! + Closed on read error. All 14 network_baseline tests pass including tcp_data_round_trip. * refactor(slirp): drop to_host buffer + 256KB cliff, use TCP backpressure Replace the guest→host write path with don't-ACK-on-EAGAIN: on WouldBlock, skip the ACK and let the guest TCP retransmit once the kernel send buffer drains. Remove the MAX_TO_HOST_BUFFER constant (256 KB cap), the overflow- close branch, and the relay_tcp_nat_data to_host flush block. Mark to_host and to_host_pending_ack dead_code pending Task 3.6 cleanup. * refactor(slirp): drop to_guest/to_host/pending_ack fields and dead helpers Remove three dead fields from TcpNatEntry that were superseded by the passt-style peek+ACK+backpressure paths added in Tasks 3.3–3.5: - to_guest: Vec<u8> (replaced by recv(MSG_PEEK)-based send in 3.3) - to_host: Vec<u8> (replaced by direct write + don't-ACK-on-WouldBlock in 3.5) - to_host_pending_ack: Option<u32> (replaced by ACK on n_written in 3.5) Remove the three matching initializer sites from the TcpNatEntry constructor. Update the file-level Architecture doc and the bytes_in_flight field comment to reflect the Phase 3 design (no userspace buffers; kernel socket buffer holds outstanding data). * test(network): flip 256KB cliff pin — assert >1MB throughput succeeds Renames tcp_to_host_buffer_drops_at_256kb → tcp_writes_more_than_256kb_succeed and rewrites the body to assert the Phase 3 positive contract: pushing 1 MB through the relay succeeds with no RST/FIN mid-stream. Updates the file-level BROKEN_ON_PURPOSE inventory accordingly. * bench(network): tcp_bulk_throughput_1mb — measures post-Phase-3 backpressure path Adds a divan microbench that pushes 1 MiB through the SLIRP relay under a constrained host receiver (SO_RCVBUF=4096), forcing the passt-style sequence-mirroring + don't-ACK-on-EAGAIN backpressure path on every iteration. Reports throughput in MB/s via BytesCount so regressions are numerically visible. Mirrors the 95%-delivery threshold from the tcp_writes_more_than_256kb_succeed contract test. ~61 ms median / ~17 MB/s on this host; 10 samples complete in well under 60 s total. * bench(network): --bulk-mb mode to measure post-Phase-3 backpressure Adds a guest→host throughput measurement that pins the host listener's SO_RCVBUF to 4096 before accept(). The constrained receiver forces TCP-level backpressure to engage during the transfer: the SLIRP relay's non-blocking write to host_stream returns WouldBlock, the relay declines to ACK the segment, and the guest retransmits — exercising the don't-ACK-on-EAGAIN path that Phase 3 introduced. Pre-Phase-3 the same scenario hit the 256 KB userspace cliff and reset the connection mid-transfer; post-Phase-3 the bytes go through. Smoke run on this host (Fedora 43 / KVM / slim x86_64): bulk-g2h[ 0]: 10485760 B in 0.429s = 1565.6 Mbps (constrained receiver) Compare to the unconstrained tcp_throughput_g2h_mbps (~1885 Mbps) — the ~17% reduction is the backpressure cost. The metric is opt-in (--bulk-mb 0 by default) so it doesn't slow down standard runs. Companion to the divan microbench tcp_bulk_throughput_1mb (commit 5fe4316) that exercises the same path at the unit level. The wall-clock metric is what we'd compare against passt+qemu in the future side-by-side run. * docs(plans): add Phase 4 plan (unified flow table refactor) * refactor(slirp): define FlowKey + FlowEntry enums (no callers yet) Add Copy to NatKey (all fields are trivially copyable: u16, Ipv4Address, u16) and clean up three clone_on_copy sites that clippy now catches. Introduce FlowKey and FlowEntry alongside the existing per-protocol types; both are marked #[allow(dead_code)] until Task 4.2 wires the unified flow_table field. * fix(ci): non-Linux stubs for benches/network.rs + voidbox-network-bench Both files used file-level `#![cfg(target_os = "linux")]`, which on macOS produces an empty crate with no `main()` → E0601. Caught by PR #68's macOS CI lanes (Lint, MSRV, Test, E2E VZ). Fix mirrors `benches/startup.rs`: keep the `main()` shape unconditional and gate only the SLIRP-using imports + body. The smoltcp dep is already `cfg(target_os = "linux")` in Cargo.toml, so the Linux-only items genuinely can't compile on macOS — wrapping them in a Linux-only module is the cleanest way to keep the cfg gating in one place. - `benches/network.rs`: `mod linux_benches { ... }` wraps every helper and `#[divan::bench]`. Top-level `fn main()` calls `divan::main()` on Linux and prints a skip notice elsewhere. - `src/bin/voidbox-network-bench/main.rs`: `mod linux_main { ... }` wraps everything from `TRANSFER_MB` to the bottom of the file. Top-level provides two cfg-gated `fn main()` shapes — Linux delegates to `linux_main::main_impl()`, non-Linux prints a skip notice. Linux validation: - cargo fmt --check: clean - cargo clippy -D warnings: clean - cargo test --test network_baseline: 14/14 * docs(plans): add three Phase 4 benches (mixed flows, per-protocol, table ops) * refactor(slirp): add flow_table field on SlirpBackend (parallel to existing maps) * refactor(slirp): migrate ICMP path to flow_table Replace all self.icmp_echo accesses (5 sites: entry insert/lookup in handle_icmp_frame, keys iteration + get_mut + remove in relay_icmp_echo) with self.flow_table keyed on FlowKey::IcmpEcho. Drop the icmp_echo field and its HashMap::new() initializer. Drop #[allow(dead_code)] from FlowKey, FlowEntry, and flow_table; add variant-level #[allow(dead_code)] on Tcp/Udp variants that are consumed in tasks 4.4 and 4.5. All 14 network_baseline pins pass; fmt + clippy clean. * refactor(slirp): migrate UDP path to flow_table Replace all self.udp_flows accesses with self.flow_table keyed on FlowKey::Udp / FlowEntry::Udp, following the same pattern as the ICMP migration in 4.3. Drop the udp_flows field and its HashMap::new() initializer. Remove #[allow(dead_code)] from FlowKey::Udp and FlowEntry::Udp now that both variants are consumed. * refactor(slirp): migrate TCP path to flow_table Move all 9 self.tcp_nat accesses to self.flow_table under FlowKey::Tcp / FlowEntry::Tcp. Drop the tcp_nat field and its HashMap::new() initialiser. Remove #[allow(dead_code)] from FlowKey::Tcp and FlowEntry::Tcp now that both variants are actively consumed. Max-concurrent check now counts FlowKey::Tcp entries in flow_table. relay_tcp_nat_data collects TCP flow keys then iterates with get_mut, matching the established ICMP/UDP patterns from 4.3–4.4. All 14 network_baseline tests pass; tcp_bulk_throughput_1mb bench: 17.06 MB/s. * refactor(slirp): update Phase 4 doc header for unified flow table * bench(network): poll_with_n_mixed_flows — mixed TCP/UDP/ICMP at scale * bench(network): process_udp_frame + process_icmp_echo_request * bench(network): add flow_table_insert_remove synthetic microbench Pure-compute baseline for Phase 4's unified HashMap<FlowKey, FlowEntry>. Measures insert + remove throughput on n=[10, 100, 1000] entries using synthetic u32 values, isolating HashMap mechanics from socket overhead. Phase 5+ reference number for hasher experiments (foldhash, ahash, SipHash) or container-shape changes (hashbrown raw API). Uses proxy data (usize -> u32 map) instead of real TcpNatEntry to avoid socket cloning cost per insert — the bench goal is HashMap cost, not socket ops. * docs(plans): add Phase 5 plan (stateless NAT + port forwarding) * feat(network): add nat.rs with stateless translate_outbound (no callers yet) Pure types (Rules, PortForward, ForwardProto) and translate_outbound function that maps guest destination addresses to host SocketAddrs. No per-flow state; deny-list check beats gateway-loopback rewrite. Doc-test + unit tests included. * refactor(slirp): add nat::Rules field on SlirpBackend (parallel to deny_list) * refactor(slirp): TCP path uses nat::translate_outbound Replace the two inline operations in the SYN branch of handle_tcp_frame with a single nat::translate_outbound call: - the SLIRP_GATEWAY_IP → 127.0.0.1 rewrite - the deny-list iteration (previously via is_denied) The RST-emission shape and warn! event are preserved verbatim. Drop the now-callerless is_denied method; add #[allow(dead_code)] to deny_list (still held for task 5.4 which migrates UDP and then drops the field). * refactor(slirp): UDP path uses nat::translate_outbound, drop deny_list field * refactor(slirp): plumb port_forwards from NetworkConfig into nat::Rules Add `port_forwards: &[(u16, u16)]` to `SlirpBackend::with_security`. Each tuple is mapped to `nat::PortForward { proto: ForwardProto::Tcp, .. }` and stored in `nat::Rules.port_forwards`. `SlirpBackend::new()` passes `&[]` as before. The cold-boot VMM construction site (`src/vmm/mod.rs`) also passes `&[]` with a TODO(5.5b) comment — `VoidBoxConfig` does not yet carry `port_forwards`, so wiring the real slice is deferred to sub-task B. The snapshot-restore site calls `SlirpBackend::new()` and is unaffected. No relay code reads `nat.port_forwards`; no host listeners are spawned. Sub-task B (5.5b) will add the actual TcpListener-per-rule logic. All 14 network_baseline tests pass. fmt + clippy clean. * test(network): pin nat::translate_outbound (loopback, external, deny) * bench(network): nat_translate_outbound_hot_path — Phase 5 baseline * feat(slirp): TcpNatState::SynSent + handle inbound SYN-ACK Add the SynSent state to TcpNatState for host-initiated (port-forward) connections. When handle_tcp_frame sees SYN+ACK on a SynSent entry it sends an ACK to the guest, advances our_seq, records guest_ack, and transitions to Established — completing the inbound 3-way handshake. Add #[cfg(test)] helpers on SlirpBackend (insert_synthetic_synsent_entry, tcp_flow_state, injected_plain_ack_count) and a unit test tcp_inbound_syn_ack_completes_handshake that seeds a SynSent entry, feeds a guest SYN-ACK, and asserts (a) state → Established and (b) one plain ACK queued for injection. The full E2E contract is deferred to task 5.5b.5 (tcp_port_forward_inbound in tests/network_baseline.rs). build_tcp_packet_static signature: (src_ip, dst_ip, src_port, dst_port, seq, ack, control, payload). The inbound ACK uses src=SLIRP_GATEWAY_IP, dst=SLIRP_GUEST_IP, src_port=key.dst_port (high port), dst_port= key.guest_src_port. * bench(network): tcp_inbound_syn_ack_transition — Phase 5.5b.1 microbench Adds a divan bench for the SynSent → Established state-machine path introduced in 5.5b.1. The bench seeds one synthetic SynSent entry, feeds a SYN-ACK frame to process_guest_frame, and measures the transition cost (~42 µs median, same order as process_syn). Approach (option a): widen the three #[cfg(test)] helpers on SlirpBackend to #[cfg(any(test, feature = "bench-helpers"))]. insert_synthetic_synsent_entry is promoted to `pub` within the gated impl block so the bench binary (a separate compilation unit) can call it. The feature is never enabled in production builds. All helpers in benches/network.rs that are only needed under bench-helpers are gated with #[cfg(feature = "bench-helpers")] to keep the default bench binary warning-free. * feat(slirp): add synthesize_inbound_syn helper for port-forwarding * bench(network): synthesize_inbound_syn pure-compute (Phase 5.5b.2.b) * feat(slirp): port-forward listener thread implementation (not wired yet) Adds the host listener thread infrastructure for Phase 5.5b inbound TCP port-forwarding. No listener is spawned yet (that is task 5.5b.4), so there is no behavior change in this commit. New items in src/network/slirp.rs: - `InboundAccept` struct (pub(crate)) — channel payload from listener to net-poll - `port_forward_listeners: Vec<JoinHandle<()>>` field — holds spawn handles - `port_forward_shutdown: Arc<AtomicBool>` field — graceful-shutdown signal - `pending_inbound_accepts: mpsc::Receiver<InboundAccept>` field — accept channel rx - `accept_sender: mpsc::Sender<InboundAccept>` field — keeps channel open + test helper - `process_pending_inbound_accepts()` — drains channel, inserts SynSent entries, queues SYNs - `run_port_forward_listener()` — module-scope thread fn, nonblocking accept loop - `spawn_port_forward_listeners()` — pub(crate) factory, not called until 5.5b.4 - `Drop for SlirpBackend` — sets shutdown flag, joins all listener handles - `push_inbound_accept()` on test-only impl — injects accepts for unit tests - `drain_to_guest` now calls `process_pending_inbound_accepts()` as step 0 TDD: test `process_pending_inbound_accepts_seeds_synsent_and_queues_syn` written and watched fail before implementation; passes in GREEN. 17/17 network_baseline integration tests unchanged. * feat(slirp): wire spawn_port_forward_listeners from with_security Call `spawn_port_forward_listeners` in `SlirpBackend::with_security` so host listener threads are actually spawned when `nat.port_forwards` is non-empty. The function now also returns the `Sender` end of the accept channel so `accept_sender` (needed to keep the channel open in tests) is sourced from the same channel pair as `pending_inbound_accepts`. Remove the `#[allow(dead_code)]` attrs from both functions. Unit test `with_security_spawns_listener_per_tcp_port_forward` confirms zero threads for empty rules and one thread per TCP rule. * test(network): tcp_port_forward_inbound — Phase 5.5b e2e contract Adds tcp_port_forward_inbound_connect_succeeds to tests/network_baseline.rs. The test builds a SlirpBackend with one port-forward rule (18080→8080), drives drain_to_guest in a loop while a host thread connects to 127.0.0.1:18080, synthesizes a guest listener by responding with SYN-ACK to the SYN the stack emits, and asserts three contract points: 1. host TcpStream::connect succeeds — listener thread (5.5b.3) is alive. 2. drain_to_guest emits a synthesized SYN to GUEST_PORT — InboundAccept channel + process_pending_inbound_accepts + synthesize_inbound_syn (5.5b.2/5.5b.3/5.5b.4) all fired. 3. drain_to_guest emits the completing ACK after our SYN-ACK — the SynSent → Established arm (5.5b.1) fired. Also adds parse_tcp_to_guest_full helper (superset of parse_tcp_to_guest that also returns src/dst ports, needed to identify the ephemeral high_port in the synthesized SYN). No VM, no --ignored flag, completes in ~0.1 s. * bench(network): port_forward_accept_latency — Phase 5.5b wall-clock baseline Times the full inbound port-forward path: host TcpStream::connect → listener thread accept() → mpsc channel → process_pending_inbound_accepts → synthesize_inbound_syn → drain_to_guest output. Bounded above by PORT_FORWARD_POLL_INTERVAL (50ms). Regressions in the inbound state machine or listener poll loop now surface numerically against this baseline. * chore(bench): add scripts/bench-compare.sh — phase comparison report Compares HEAD against an arbitrary baseline ref using the two bench harnesses: divan microbenches (cargo bench --bench network) and the VM-backed wall-clock harness (voidbox-network-bench). Emits markdown with absolute numbers + percent deltas, suitable for PR descriptions. Replaces the scattered /tmp/baseline-network-phase*.json files with a reproducible single entry-point. * test(network): icmp_echo_returns_reply — probe + assert, no silent skip Reviewer (Copilot) flagged that the previous skip-on-no-reply path masked real ICMP regressions on hosts where unprivileged ICMP works. Probe socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP) once: skip only on EPERM/EACCES (sysctl net.ipv4.ping_group_range forbids it), assert otherwise. * fix(network-bench): skip failed iterations + drop guest-ping ICMP path C1.3: measure_tcp_throughput_g2h and measure_bulk_throughput_g2h now `continue` on guest nc non-zero exit so failed iterations don't skew the reported mean. C2.2: measure_icmp_rr_latency dropped its guest-side ping path. The guest images intentionally omit /bin/ping (busybox-static lacks CONFIG_FEATURE_PING_TYPE_DGRAM and SOCK_RAW would need root); the function now returns None with a warn explaining the gap. Proper host-driven measurement is tracked as a follow-up. * bench(network): migrate from deprecated .poll() to drain_to_guest() Aligns benches with the production RX path. The deprecated poll() allocated a fresh Vec<Vec<u8>> per call; drain_to_guest appends to a caller-owned buffer that's reused across iterations. CI now gates on the same allocator pattern production code uses, removing avoidable allocation overhead from the measurements. Drops the file-level #![allow(deprecated)]. * chore(bench): bench-compare.sh — fall back without bench-helpers feature When the baseline ref pre-dates a464fc1 (Phase 5.5b.1) it doesn't have the `bench-helpers` cargo feature, and cargo errored out, dropping the entire baseline divan section to —. Detect the "does not have feature" / "unknown feature" stderr signal and retry without --features bench-helpers. Benches that exist at both refs get real Δ%; the bench-helpers-gated ones (synthesize_inbound_syn, tcp_inbound_syn_ack_transition) naturally remain — for baseline. Unlocks bench-compare.sh --baseline origin/main against the full branch history. * fix(network-bench): bound accept-thread lifetimes with deadlines drain_one_connection, rr_echo_server, crr_echo_server now accept with a deadline derived from LATENCY_RECV_TIMEOUT + slack. Previously each spawned thread blocked forever on listener.accept() if the guest nc never connected (exec error, network failure), holding the listener FD across all subsequent iterations and burning thread/FD slots. When the accept deadline lapses, the thread exits cleanly, the listener drops, and the next iteration starts with a clean slate. Addresses Copilot review C2.1, C2.5, C2.6. * docs: Phase 6 overview plan — TCP lifecycle + async connect + windows + epoll Scopes the four architectural follow-ups surfaced in the smoltcp-passt-port PR review: - 6.1 (high) : TCP half-close (FinWait*/CloseWait/LastAck) — silent data loss on shutdown(SHUT_WR) today. - 6.2 (med-h) : Async outbound connect — vCPU thread blocked up to 3 s on slow destinations. - 6.3 (med) : Window management + scaling — guest window ignored; advertised window hardcoded 65535. - 6.4 (med-l) : Event-driven RX polling — replace 5 ms timer with epoll_wait. Locks the observability + cross-platform + snapshot invariants from the top-level spec. Per-subsystem TDD task lists deferred to dedicated plans (-phase6.1.md..-phase6.4.md) written before each kicks off. * docs: Phase 6.4 detailed TDD plan — epoll-driven RX 14 bite-sized tasks covering the event-driven RX rewrite: - Task 1: drain_n migration + tcp_writes_more_than_256kb_succeed retransmit-test fix-up (Copilot C1.1; correctly classified as VALID after re-reading; the previous "stale" verdict missed that seq advances unconditionally on line 398). - Task 2: BROKEN_ON_PURPOSE pin tcp_rx_latency_sub_5ms. - Tasks 3-6: EpollDispatch module — epoll_create1, register/ unregister, wait_with_timeout, self-pipe Waker. - Task 7: SlirpBackend holds EpollDispatch + Waker. - Tasks 8-9: TCP/UDP/ICMP register-on-insert + unregister-on-remove. - Task 10: relay_*_data dispatch by readiness instead of full-table iteration. - Task 11: net_poll_thread rewritten to epoll_wait(50ms); BROKEN_ON_PURPOSE pin flips to passing. - Task 12: snapshot rebuild path (epoll_fd doesn't survive snapshot). - Task 13: tcp_rx_latency_one_packet bench + perf gate. - Task 14: full validation contract. Hard perf gate: scripts/bench-compare.sh --baseline origin/main must show every comparable bench at HEAD ≤ baseline + 5%, AND tcp_rx_latency_us_p50 < 5ms (was bounded below by 5ms timer cycle), AND port_forward_accept_latency improves by ≥30% (or documented exception). Lives on branch smoltcp-passt-port-phase6.4-epoll, cut from PR #68's tip (47868f0). Will become its own PR once the plan executes. * test(network): drain_n via drain_to_guest + real retransmit in 256kb test Two test-harness improvements landing together since both block the Phase 6.4 RX-latency work: - drain_n migrated from deprecated SlirpBackend::poll() to drain_to_guest. This was the last in-tree poll() caller. - tcp_writes_more_than_256kb_succeed now matches its 'we re-send those' comment: seq only advances when acked_seq catches up, giving real TCP-retransmit semantics in the synthetic guest rather than the previous 'lossy with 95% tolerance' shape. Phase 6.4 must not regress this contract; making the test faithful first means epoll regressions surface as failures instead of borderline 95% misses. * test(network): pin tcp_rx_latency_sub_5ms (BROKEN_ON_PURPOSE) Phase 6.4 contract: host→guest RX latency must be sub-5 ms when data is available. Pre-6.4 the floor is the 5 ms net_poll_thread sleep cycle; this assertion fails on master and on the current PR #68 tip. Phase 6.4's epoll dispatch will flip it to passing. Mark with #[ignore] is deliberately NOT used: this is a positive contract and CI must surface the failure on master so the gate is unmissable. * Revert "test(network): pin tcp_rx_latency_sub_5ms (BROKEN_ON_PURPOSE)" This reverts commit 3e47ffbd2ba9f7edee64047145bf0e6ff3a0a811. * docs(phase6.4): drop Task 2 unit-level pin — VMM-level contract instead Implementer caught a planning error: the 5 ms host→guest latency floor lives in src/vmm/mod.rs:1609 (net_poll_thread sleep), not in SlirpBackend. drain_to_guest runs synchronously from a unit test, so the floor never materializes there — the original Task 2 test underflowed on subtraction and would never have observed what we care about. The contract — host→guest must deliver in < 5 ms — is preserved as a VM-level requirement in Task 13's wall-clock tcp_rx_latency_us_p50 metric in voidbox-network-bench. The plan's hard-perf-gate already requires it, so no contract is lost. Reverts 3e47ffb (the broken unit pin) and amends the plan to mark Task 2 as DROPPED with rationale. * feat(network): EpollDispatch skeleton with epoll_create1 Phase 6.4 foundation. One epoll_fd owned via OwnedFd + EPOLL_CLOEXEC. No registration logic yet — Task 4 will add register/unregister and Task 6 will add the self-pipe + wait loop. * feat(network): EpollDispatch register/unregister * feat(network): EpollDispatch::wait_with_timeout * feat(network): EpollDispatch self-pipe wakeup Cloneable Waker writes one byte to a non-blocking pipe registered with EPOLLIN. wait_with_timeout filters self-pipe events out of the returned set and drains the pipe so subsequent waits don't spurious-wake. * refactor(slirp): SlirpBackend holds Arc<Mutex<EpollDispatch>> + Waker Start with Arc<Mutex<EpollDispatch>> directly instead of plain EpollDispatch so Task 11's poll-thread refactor (which needs the Arc to share the same instance across threads) lands without a second struct-field touch. SlirpBackend gains two new fields: - `epoll: Arc<Mutex<EpollDispatch>>` — the epoll instance - `epoll_waker: Waker` — cloneable write-end of the self-pipe Both fields are unused until Task 8 wires up the register/unregister calls; #[allow(dead_code)] suppresses clippy -D warnings in the interim. Targeted #[allow(dead_code)] attributes on unregister, wait_with_timeout, EpollEvent, and Waker::wake/write_end cover the same interim period. All 18 baseline pins pass; epoll_dispatch unit tests (5/5) pass. * feat(slirp): register TCP flows with EpollDispatch After each TCP flow-table insertion, register the host TcpStream FD with EpollDispatch (EPOLLIN, token = flow_token_for_tcp) and call epoll_waker.wake() to unblock any polling thread. Before each removal, unregister the FD to prevent dangling epoll registrations. Three insertion sites covered: - process_pending_inbound_accepts: port-forward SynSent entries - handle_tcp_frame: outbound SYN → SynReceived entries - insert_synthetic_synsent_entry (test helper): guarded with #[cfg(not(test))] so unit tests skip epoll side-effects Two removal sites covered: - relay_tcp_nat_data to_remove loop: Closed + timeout evictions - handle_tcp_frame stale-entry purge before re-SYN Token layout (flow_token_for_tcp): bits 63-56: 0x01 (PROTO_TAG_TCP) bits 47-32: guest_src_port bits 31-16: dst_port bits 15-0: low 16 bits of dst_ip PROTO_TAG_UDP, PROTO_TAG_ICMP, flow_token_for_udp, flow_token_for_icmp are defined here (at module scope, project convention) with #[allow(dead_code)] — Task 9 wires them immediately after. All 18 baseline pins pass; 5/5 epoll unit tests pass. * feat(slirp): register UDP + ICMP flows with EpollDispatch Mirror Task 8's TCP pattern for the two remaining protocol families. UDP (handle_udp_frame): - On first datagram for a new (guest_src_port, dst_ip, dst_port) 3-tuple, record new_host_fd before the Entry API moves the UdpSocket into the flow table, then register(fd, flow_token_for_udp, EPOLLIN) + wake(). - relay_udp_flows idle-reaper: unregister(sock.as_raw_fd()) before remove. ICMP (handle_icmp_frame): - On first echo request for a new (guest_id, dst_ip) pair, record new_icmp_fd before the Entry API moves the socket, then register + wake() after the borrow ends. - relay_icmp_echo idle-timeout eviction: look up sock fd and unregister before remove (borrow already dropped by then). Token layouts: flow_token_for_udp: 0x02 | guest_src_port<<32 | dst_port<<16 | dst_ip_low16 flow_token_for_icmp: 0x03 | guest_id<<32 All 18 baseline pins pass; 23/23 lib-network tests pass. Bench compilation still clean. * feat(slirp): relay loops dispatch by epoll readiness drain_to_guest polls the EpollDispatch with a zero-duration timeout and passes the resulting readiness set to the three relay methods (relay_tcp_nat_data, relay_icmp_echo, relay_udp_flows). Each relay now filters by protocol tag (PROTO_TAG_{TCP,UDP,ICMP}) and only visits flows whose socket appears as EPOLLIN-ready in the event set, avoiding O(flow_count) reads-on-every-tick. relay_tcp_nat_data uses a two-pass design: Pass 1 sweeps all TCP entries for Closed state and idle timeout unconditionally (so a guest FIN that marks an entry Closed in handle_tcp_frame causes the host TcpStream to drop promptly, giving the server-side reader an EOF); Pass 2 restricts the peek/relay I/O to ready entries only. epoll_arc() added to NetworkBackend trait (Linux cfg-gated, default None) and overridden on SlirpBackend. VirtioNetDevice.epoll_arc() delegates to the backend, enabling net_poll_thread (Task 11) to obtain the shared Arc without an additional lock or refactor. All 18 baseline pins pass. * feat(vmm): net_poll_thread driven by epoll_wait Replace the fixed 5 ms sleep with a blocking epoll_wait(50 ms) on the EpollDispatch instance obtained from the network backend. The thread wakes immediately when any registered host socket becomes readable (relay loop runs at event time, not after a fixed delay) and falls back to a 50 ms housekeeping tick when idle — preserving the UDP/ ICMP stale-flow reap path that was previously driven by the 5 ms sleep. If the backend does not expose an epoll instance (non-SlirpBackend, e.g. unit-test mocks), the thread keeps the original 5 ms sleep fallback. All 18 baseline pins pass. Release build clean. * feat(slirp): rebuild epoll set on snapshot restore `epoll_fd` is a Linux kernel handle that does not survive snapshot: after `MicroVm::from_snapshot` creates a fresh `SlirpBackend` via `SlirpBackend::new()`, the new `EpollDispatch` starts with zero registered FDs. The current snapshot path does not reconstruct `flow_table` — the backend always starts empty and new flows form naturally — so the rebuild is a no-op today. It is wired in advance so Phase 6.1's half-close work (which will persist restored flows across snapshot/restore) has a ready call site. Changes: - `EpollDispatch`: add `registered_count` field maintained by `register`/`unregister`; expose `registered_fd_count()` under `cfg(any(test, feature = "bench-helpers"))`. - `SlirpBackend::rebuild_epoll_from_flow_table()`: iterates `flow_table` and re-registers each live host FD (`host_stream`, `sock` for UDP/ICMP) with the current dispatcher. - `SlirpBackend::registered_fd_count()`: test/bench shim that delegates to `EpollDispatch::registered_fd_count()`. - `SlirpBackend::reset_epoll_for_snapshot_test()`: replaces the epoll dispatcher with a fresh empty one, simulating the post-snapshot state (kernel handle gone) for unit-level smoke tests. - `epoll_set_rebuilt_from_flow_table_smoke` in `network_baseline`: insert flow → reset epoll → assert count 0 → rebuild → assert count 1. * fix(test): gate epoll_set_rebuilt smoke test on bench-helpers feature The smoke test consumes #[cfg(any(test, feature = "bench-helpers"))]- gated helpers (insert_synthetic_synsent_entry, reset_epoll_for_snapshot_test, registered_fd_count). Integration tests in tests/ don't get cfg(test) on the void-box library crate — they only see #[cfg(feature = "bench-helpers")] items when the feature is enabled. Without this gate, default `cargo test --test network_baseline` fails to compile with E0599 on the four helper methods. Now: - Default cargo test → 18 pins pass, smoke test invisible. - cargo test --features bench-helpers -- --test-threads=1 → 19 pins pass, smoke test included. The serial-run requirement is to side-step a pre-existing parallel-run flake in tcp_port_forward_inbound_connect_succeeds (host port-bind contention; not a Phase 6.4 regression). * bench(network): tcp_rx_latency_one_packet — Phase 6.4 baseline Divan microbench (`tcp_rx_latency_one_packet`) measures the SLIRP-layer per-packet dispatch cost when one TCP flow is Established and the host kernel has data ready: one zero-timeout epoll_wait + readiness scan + peek + Ethernet frame construction. Measured median on this host: ~9.8 µs per drain_to_guest call. Pre-6.4 the relay iterated every flow in flow_table unconditionally regardless of readiness. Post-6.4 it dispatches only the flows with an epoll EPOLLIN event, reducing wasted work on idle flows to zero. This bench is the regression anchor for that change. The bench is gated on `--features bench-helpers` (like the existing `tcp_inbound_syn_ack_transition` and `synthesize_inbound_syn` benches). It performs a full 3-way handshake outside the timed loop so only the hot relay path is measured. Note: this bench cannot exercise the net_poll_thread 50 ms epoll cycle (that thread does not run inside divan). The wall-clock host→guest latency floor is the province of voidbox-network-bench's `tcp_rx_latency_us_p50` field. That field is added to the Report struct in this commit but returns None (deferred): wiring a guest-side listener requires either a guest daemon or an additional exec RPC — both out of scope for Phase 6.4. The divan microbench is the primary numerical deliverable for this phase. * perf(slirp): eliminate epoll mutex contention via event queue net_poll_thread holds the EpollDispatch mutex for the full 50 ms of its blocking wait. drain_to_guest's own non-blocking wait_with_timeout(ZERO) call contended on the same mutex, serializing the vCPU thread behind the net-poll thread. voidbox-network-bench saw TCP g2h throughput drop from ~1885 Mbps to ~44 Mbps (40× regression). Fix: SlirpBackend gets a small Mutex<Vec<EpollEvent>> queue. net_poll_thread pushes events into it after each successful wait_with_timeout. drain_to_guest drains the queue (brief uncontended lock) without touching EpollDispatch. A try_lock fallback path serves unit tests (no net_poll_thread) without blocking on the mutex. NetworkBackend trait gains a push_ready_events default-no-op so SlirpBackend can override it; VirtioNetDevice exposes push_events_to_backend as the trampoline called by net_poll_thread. Off-CPU profile evidence: drain_to_guest was 9% off-CPU (29.7s in a 60s window) waiting on the epoll mutex; should drop to near-zero post-fix. * perf(slirp): replace two-pass relay sweep with lazy close queue relay_tcp_nat_data's Pass 1 unconditionally copied every TCP FlowKey into a Vec to scan for Closed entries on every drain call. Cache misses 47/1K under load; poll_with_n_flows/100 regressed +246% (130ns → 450ns), /1000 regressed +220%. Fix: when handle_tcp_frame's FIN/RST handlers and mid-function error paths set state=Closed, push the key onto a pending_close Vec. relay_tcp_nat_data drains this Vec at the top of its single ready-events pass — no O(n) collect required. Idle-timeout detection retains a direct flow_table iteration but without allocating a separate key Vec. * perf(slirp): prefer try_lock on epoll over pending_events in drain_to_guest The initial Bug A fix used pending_events.lock() + try_lock(epoll) in drain_to_guest's fast path, adding ~150ns overhead per call vs Phase 6.4 (one extra Mutex acquire). This showed as +38% regression in poll_idle bench (441ns → 611ns). Revised approach: try_lock epoll first (zero cost when uncontended — tests, benches, idle net-poll thread). On Err (net_poll_thread holds the mutex for 50ms), drain pending_events instead. In production the try_lock fails ~once per 50ms window; in tests it always succeeds. Net result: drain_to_guest overhead matches Phase 6.4 when epoll is uncontended; contention eliminated when net_poll_thread is actively waiting. * perf(slirp): wake net-poll thread when process_guest_frame queues frames Pre-Phase-6.4, net_poll_thread woke unconditionally every 5 ms, so every ACK queued in inject_to_guest by handle_tcp_frame got flushed within 5 ms. Phase 6.4's epoll_wait(50 ms) waits for FD readiness events — but a guest writing data has no FD-side signal (the guest is the writer; the SLIRP-side socket only becomes readable when the host responds). So queued ACKs sat 50 ms before being flushed; TCP send window stalled; voidbox-network-bench TCP g2h dropped from ~1885 Mbps to ~225 Mbps even after the mutex-contention fix. Fix: track inject_to_guest length around process_guest_frame's ethertype dispatch. If the call queued any frames, call epoll_waker.wake() — one byte to the non-blocking self-pipe, which unblocks net_poll_thread's epoll_wait so the queued frames flush within microseconds. Also fixes the related drain_to_guest event-source ordering bug: pending_events (filled by net_poll_thread) is now ALWAYS drained first, with the non-blocking epoll poll only running as a fallback when the queue is empty (test/bench paths without net_poll_thread). The previous code took the try_lock branch when net-poll was between iterations and silently dropped events the net-poll thread had already pushed. voidbox-network-bench post-fix: g2h: ~6000 Mbps (vs master 1885; +3.2x) bulk-g2h: ~3900 Mbps (vs master 1565; +2.5x at SO_RCVBUF=4096) rr p50: 2 us (parity with master) crr p50: ~50 ms (5x regression vs master ~10 ms — separate bug, tracked in follow-up; the 50 ms is exactly one epoll_wait cycle and points to a connection-establishment latency issue independent of the throughput path) * perf(vmm): epoll_wait timeout 50ms → 5ms — restore CRR latency CRR p50 was regressing +40 ms (10 ms → 51 ms) post-Phase-6.4. The +40 ms exactly matches Linux's TCP delayed-ACK timer, and the cause is that Phase 6.4 widened the net-poll IRQ re-pulse cadence from 5 ms to 50 ms. The Linux guest spends most idle time in HLT and relies on regular vCPU scheduling slots — driven by our IRQ pulses — to advance its TCP delayed-ACK timer. At 50 ms cadence the guest's pure ACKs ride the next event-triggered IRQ, which can be 40+ ms away. At 5 ms the housekeeping cadence mirrors pre-6.4 and the timer fires on schedule. We lose Phase 6.4's headline "10x idle-wakeup reduction" goal but fast-path events still wake immediately via epoll readiness — so the net win vs master is unchanged: g2h throughput +250%, bulk throughput +250%, RR parity, CRR parity. voidbox-network-bench post-fix: g2h: ~6500 Mbps (vs master 1885; +247%) bulk-g2h: ~5400 Mbps (vs master 1565; +245%) rr p50: ~3 us (parity) crr p50: ~10100 us (parity — back to baseline 10 ms) * chore: remove Phase-N references from inline + doc comments Per AGENTS.md doc-comment style ("avoid ticket IDs and PR/commit references inside doc comments and inline comments — they belong in commit messages and PR descriptions where they're audit trail; in code they age into noise as the ticketing context evolves"). Phase references fall into the same category. Comments are rewritten in present tense to explain the structural reasoning without referencing when each piece landed. Identifiers like test names and BROKEN_ON_PURPOSE markers are unchanged. Plan/spec docs in docs/superpowers/plans/ are intentionally untouched — phase references there ARE the audit trail. * perf(vmm): adaptive epoll_wait timeout (5 ms active / 50 ms idle) Recovers Phase 6.4's headline 10x idle-wakeup reduction without re-introducing the +40 ms CRR regression that forced the cadence back to a fixed 5 ms. The adaptive policy: - last cycle had any kernel event → next timeout 5 ms (active) - last cycle timed out (no events) → next timeout 50 ms (idle) A single quiet cycle drops us to idle; a single event puts us back in active in the next cycle. The subtlety that motivated the additional EpollDispatch change: when the vCPU thread calls epoll_waker.wake() during a 50 ms idle wait, the kernel's epoll_wait returns with the self-pipe event. wait_with_timeout filters that event out and drains the pipe — so `epoll_events.is_empty()` would have remained true, and the naive "is_empty ⇒ idle" predicate kept us at 50 ms forever, regressing CRR p50 back to ~50 ms. wait_with_timeout now returns the *raw* kernel count (including self-pipe wakes) so the adaptive policy treats wakes as activity. Filtered events still arrive in the out parameter unchanged; only the return value's meaning shifted from "observable count" to "raw count," which all existing callers ignore. voidbox-network-bench post-fix: g2h: ~6680 Mbps (vs 5 ms fixed: 6500; vs master: +254%) bulk-g2h: ~5550 Mbps (vs 5 ms fixed: 5400; vs master: +254%) rr p50: 1 us (in 99-sample iteration; parity) crr p50: ~10100 us (parity preserved — adaptive correctly holds 5 ms cadence during connection bursts because each connection's wake() keeps raw_kernel_events > 0) Idle CPU dropped: profile-pre showed net_poll_thread on-CPU 4.93 % of total at fixed 5 ms cadence (200 wakes/sec); adaptive should drop to ~10x lower during idle stretches between iterations. * fix(slirp): collision-safe flow tokens via monotonic counter flow_token_for_tcp/udp truncated dst_ip to 16 bits; flow_token_for_icmp omitted dst_ip entirely. Multiple flows could collide on the same token, mis-routing readiness events to the wrong FlowKey. Replace the lossy encoding with a monotonic AtomicU64 counter per backend. Tokens are still tagged in the high byte for protocol demux (PROTO_TAG_TCP/UDP/ICMP); the lower 56 bits are unique. A new token_to_key HashMap makes readiness → FlowKey lookup O(1) instead of the previous linear flow_table scan. * fix(slirp): EpollDispatch lock-free — register/unregister never block net_poll_thread held the Mutex<EpollDispatch> across the blocking epoll_wait call (up to 50 ms in idle cadence). vCPU register/ unregister paths in handle_tcp_frame (and friends) had to acquire the same mutex and would block behind the wait, stalling guest TCP SYN handling for up to 50 ms during connection setup. epoll_ctl and epoll_wait are kernel-thread-safe on the same epoll fd; the only state requiring synchronization was the self-pipe (now eagerly initialized in EpollDispatch::new) and the registered fd count (now AtomicUsize). EpollDispatch becomes Sync without an external Mutex — the type changes from Arc<Mutex<EpollDispatch>> to Arc<EpollDispatch>. register/unregister run lock-free against the wait thread; only the kernel's per-epoll-fd internal lock serializes, and that's a fast path. * perf(slirp): O(1) dedup in idle-timeout sweep via HashSet to_remove.contains() inside the idle-timeout loop was O(n*k) under churn. Switch the membership check to a HashSet<FlowKey> and only materialize the Vec once at the end for the removal loop. * docs(vmm): net_poll_thread doc-comment matches adaptive timeout * style: rust-style sweep on Phase 6.4 / Copilot-fix code Apply project rust-style rules to the recently-landed Phase 6.4 code (token rewrite + lock-free EpollDispatch refactor): 1. RegisterMode enum replaces (readable: bool, writable: bool) on EpollDispatch::register. Closed-set policy at the call site (Read / Write / ReadWrite) over two opaque booleans. 2. matches!() removed at three sites — the project guide prefers full match (or boolean ==) for compiler diagnostics if the matched type changes. The unprivileged-ICMP errno check now uses == comparisons; FlowKey::Tcp counter uses a for loop. 3. Iterator chains in relay loops rewritten as for loops with mutable accumulators per the project rule. relay_tcp_nat_data, relay_udp_flows, relay_icmp_echo all touched. Logic unchanged; control flow now reads top-down without a chain of .filter().filter_map().collect(). 4. Local renamed `rc` → `epoll_ctl_result` at three EpollDispatch sites. Role-bearing names are required in non-tiny scopes. 5. Dropped redundant explanatory comments around the relay loops ("Data relay — only for flows with…", "Skip entries already queued for…", "Collect ready ICMP flow keys via…"). The code below them is self-describing. Kept structural "why" comments (the ICMP idle-sweep rationale, the per-flow socket Drop contract). No behavior change. cargo fmt, clippy -D warnings, network_baseline (18/18), lib network (23/23), and voidbox-network-bench wall-clock (g2h ~6580 Mbps, CRR ~32 µs) all green.

dpsoft added 30 commits April 27, 2026 17:52

test(network): scaffold network_baseline pins with frame helpers

8d63aaa

test(network): address review — restore reserved constants, alias IpA…

bc9eefb

…ddress

test(network): pin TCP handshake SYN-ACK emission

21134d8

test(network): pin TCP guest↔host data round-trip

1226986

test(network): BROKEN_ON_PURPOSE pin — 256 KB to_host cliff

5838586

test(network): hoist inline use statements to module scope

6cc850c

test(network): pin TCP rate limit, concurrent cap, deny list

a5b9128

test(network): pin ARP reply behavior for gateway and subnet

cf59b33

test(network): fix build_udp_frame payload_len double-count

40c0f7e

test(network): BROKEN_ON_PURPOSE pin — UDP non-DNS dropped

279af3b

test(network): BROKEN_ON_PURPOSE pin — ICMP echo dropped

4d96ad7

bench(network): divan microbenches for SLIRP hot paths

41c8382

bench(network): parametric NAT-walk scaling at 1/100/1000 flows

499ee35

bench(network): DNS cache hit and miss paths

7cca766

ci(bench): include network microbenches in regression gate

7868bb2

bench(network): voidbox-network-bench binary scaffold

e1ed1e2

bench(network): TCP throughput via busybox nc

df898d6

Implement measure_tcp_throughput_g2h: binds a host-side TCP listener, boots a VM, execs dd|nc in the guest, drains to EOF on the host, and computes Mbps from bytes_received / elapsed. h2g left None with a TODO.

bench(network): UDP DNS qps and JSON report output

594190b

feat(network): introduce NetworkBackend trait

b7e426c

refactor(slirp): add drain_to_guest wrapper for trait fit

046d57d

feat(slirp): impl NetworkBackend for SlirpStack

66f007f

refactor(network): rename SlirpStack to SlirpBackend

bf3cd6a

Type rename only — the slirp.rs module file keeps its name. SlirpBackend reflects the user-mode-NAT role rather than the underlying smoltcp library, keeping naming symmetric with future TapBackend / VhostNetBackend siblings.

docs(plans): add Phase 1 plan (ICMP echo via SOCK_DGRAM IPPROTO_ICMP)

028707c

dpsoft requested a review from Copilot April 30, 2026 14:08

Copilot started reviewing on behalf of dpsoft April 30, 2026 14:09 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

dpsoft added 6 commits April 30, 2026 11:19

feat(slirp): add synthesize_inbound_syn helper for port-forwarding

9b077d2

bench(network): synthesize_inbound_syn pure-compute (Phase 5.5b.2.b)

473971f

dpsoft requested a review from Copilot April 30, 2026 15:37

Copilot started reviewing on behalf of dpsoft April 30, 2026 15:37 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

dpsoft requested a review from Copilot April 30, 2026 21:00

Copilot started reviewing on behalf of dpsoft April 30, 2026 21:00 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

dpsoft added 6 commits April 30, 2026 18:47

dpsoft changed the title ~~WIP: SLIRP refactor — lift passt patterns into our smoltcp stack (Phases 0–3)~~ SLIRP refactor — lift passt patterns into our smoltcp stack (Phases 0–5 + 5.5b) Apr 30, 2026

dpsoft marked this pull request as ready for review April 30, 2026 22:21

dpsoft mentioned this pull request May 3, 2026

Phase 6.4: epoll-driven RX dispatch with adaptive timeout (3.5× throughput vs main) #69

Merged

6 tasks

dpsoft merged commit f090868 into main May 6, 2026
22 checks passed

dpsoft deleted the smoltcp-passt-port-phase0 branch May 6, 2026 00:00

-  LC_ALL=en_US.UTF-8 awk -F'│' '
+  local awk_locale="C"
+  if command -v locale >/dev/null 2>&1; then
+    if locale -a 2>/dev/null | grep -Fxq 'C.UTF-8'; then
+      awk_locale='C.UTF-8'
+    elif locale -a 2>/dev/null | grep -Fxq 'C.utf8'; then
+      awk_locale='C.utf8'
+    elif locale -a 2>/dev/null | grep -Fxq 'en_US.UTF-8'; then
+      awk_locale='en_US.UTF-8'
+    fi
+  fi
+  LC_ALL="$awk_locale" awk -F'│' '

Conversation

dpsoft commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in this branch

Phase 0 — Baseline + NetworkBackend trait

Phase 1 — ICMP echo

Phase 2 — Generalized UDP

Phase 3 — TCP relay rewrite

Phase 4 — Unified flow table

Phase 5 — Stateless NAT translation refactor

Phase 5.5b — Inbound port-forwarding

Bench harness + comparison

Reviewer-driven fixes (this PR)

Bonus fixes (caught while debugging)

Non-negotiable invariants (preserved through every phase)

CI changes

Validation

Bench comparison

What's NOT in this branch (deferred)

Test plan for review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dpsoft commented Apr 29, 2026 •

edited

Loading

Phase 0 — Baseline + `NetworkBackend` trait