Skip to content

SLIRP refactor — lift passt patterns into our smoltcp stack (Phases 0–5 + 5.5b)#68

Merged
dpsoft merged 92 commits intomainfrom
smoltcp-passt-port-phase0
May 6, 2026
Merged

SLIRP refactor — lift passt patterns into our smoltcp stack (Phases 0–5 + 5.5b)#68
dpsoft merged 92 commits intomainfrom
smoltcp-passt-port-phase0

Conversation

@dpsoft
Copy link
Copy Markdown
Contributor

@dpsoft dpsoft commented Apr 29, 2026

Status: Ready for review. Phases 0–5 + 5.5b of docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md complete on this branch. Phase 6 (TCP half-close, async connect, window mgmt, event-driven polling) is scoped as an overview plan in docs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.md and shipped as separate PRs.

Supersedes the design in docs/superpowers/plans/2026-04-12-network-backend-abstraction.md — instead of adopting passt as an external backend, this branch lifts its design patterns into our existing all-Rust smoltcp-based SLIRP stack. Keeps the all-Rust differentiator (full observability via tracing, debugger works, no opaque process boundary) while picking up passt's correctness wins (ICMP, full UDP, sequence-mirroring TCP, port-forwarding, stateless NAT).

What's in this branch

Phase 0 — Baseline + NetworkBackend trait

  • tests/network_baseline.rs — unit-level pins driving SlirpBackend directly (no VM), including BROKEN_ON_PURPOSE assertions that flip in later phases.
  • benches/network.rs — divan microbenches for SLIRP hot paths.
  • voidbox-network-bench — wall-clock harness with metric names matching passt's published table.
  • NetworkBackend trait extracted; SlirpStackSlirpBackend (role-based name).
  • Per-poll Vec<Vec<u8>> allocation dropped (drain_to_guest(&mut Vec) out-param).

Phase 1 — ICMP echo

  • SOCK_DGRAM IPPROTO_ICMP host socket (unprivileged). Sysctl-restricted hosts get warn-once + drop fallback. BROKEN_ON_PURPOSE pin flipped: icmp_echo_returns_reply positively asserts the round-trip.

Phase 2 — Generalized UDP

  • Per-flow connected-socket NAT for any UDP destination (was port-53-only). DNS retains its dedicated cached fast-path. 60 s idle reap. BROKEN_ON_PURPOSE pin flipped.

Phase 3 — TCP relay rewrite

  • passt-style sequence mirroring: recv(MSG_PEEK) for host→guest, no userspace buffer. ACK-driven kernel consume. guest→host now uses TCP backpressure (don't-ACK-on-EAGAIN) instead of a 256 KB userspace cap. to_guest/to_host/to_host_pending_ack/MAX_TO_HOST_BUFFER all deleted. BROKEN_ON_PURPOSE pin flipped: tcp_writes_more_than_256kb_succeed pushes 1 MB through the relay.

Phase 4 — Unified flow table

  • FlowKey / FlowEntry enums collapse the parallel tcp_nat, udp_flows, icmp_echo tables into a single HashMap<FlowKey, FlowEntry>. Each protocol's relay loop iterates the unified table with a matches! filter. Establishes the substrate for Phase 5+ stateless NAT and Phase 6 epoll dispatch.

Phase 5 — Stateless NAT translation refactor

  • New src/network/nat.rs with translate_outbound(rules, dst, port, gateway) — pure function, no state, no allocation in the hot path.
  • Rules carries gateway_loopback, deny_cidrs, and port_forwards. The TCP and UDP guest-frame handlers consume it through the same code path.
  • deny_list field dropped from SlirpBackend (subsumed by nat::Rules.deny_cidrs).

Phase 5.5b — Inbound port-forwarding

  • SlirpBackend::with_security spawns one host listener thread per TCP port-forward rule. On accept, the listener pushes through an mpsc channel; process_pending_inbound_accepts (called at the top of drain_to_guest) seeds a SynSent flow-table entry and synthesizes the inbound SYN to the guest. New TcpNatState::SynSent handles the resulting SYN-ACK transition to Established.
  • Wall-clock latency bench port_forward_accept_latency (~50 ms ceiling, bounded by PORT_FORWARD_POLL_INTERVAL).

Bench harness + comparison

  • cargo bench --bench network --features bench-helpers — divan microbenches.
  • voidbox-network-bench --output report.json [--bulk-mb N] — wall-clock harness, JSON report.
  • scripts/bench-compare.sh --baseline <ref> [--skip-vm] [--skip-divan] — markdown comparison of HEAD vs any baseline ref. Falls back gracefully when the baseline ref pre-dates bench-helpers (Phase 5.5b).

Reviewer-driven fixes (this PR)

Bundled correctness improvements from the Copilot review on this PR:

  • C1.2icmp_echo_returns_reply now probes socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP) once and asserts on the round-trip; only skips on EPERM/EACCES (sysctl restriction).
  • C1.3measure_tcp_throughput_g2h and measure_bulk_throughput_g2h now continue on guest non-zero exit so failed iterations don't skew the mean.
  • C2.1 / C2.5 / C2.6 — three accept-thread leak patterns in voidbox-network-bench bound with accept_with_deadline so spawned echo/drain threads exit cleanly when the guest never connects.
  • C2.2measure_icmp_rr_latency dropped its guest-side ping invocation (the test/production images intentionally omit /bin/ping); the metric is now an explicit None with a tracing warning until a host-driven measurement path lands in Phase 6.
  • C2.4benches/network.rs migrated from the deprecated SlirpBackend::poll() (per-call Vec<Vec<u8>> alloc) to drain_to_guest(&mut Vec<Vec<u8>>) with a reused buffer; #![allow(deprecated)] removed.

Bonus fixes (caught while debugging)

  • setuid busybox in test image (commit 77dfc67): chmod u+s on user-packed cpio dropped PID 1 euid to uid 1000, breaking CAP_NET_ADMIN-needing setup_network. Reverted.
  • vsock backend snapshot mismatch (commit 0d0ab20): voidbox-startup-bench's capture_snapshot was missing .enable_snapshots(true), so cold-boot used Vhost and warm-restore (always Userspace) failed. Fixed; continue-on-error: true removed from the bench step (was masking this on main).

Non-negotiable invariants (preserved through every phase)

  1. All-Rust — no opaque process in the data path.
  2. Full observability via tracing — every state transition emits a structured event.
  3. cargo test-driveable — every behavior change exercised by tests/network_baseline.rs without a VM.
  4. Standard Rust tooling — LSP, clippy, profiler keep working.

CI changes

  • voidbox-network-bench --iterations 3 step in startup-bench.yml, no continue-on-error.
  • continue-on-error: true removed from the voidbox-startup-bench wall-clock step.
  • Divan network microbenches added to the regression gate (Phase 0 task 0B.4).

Validation

Suite Status
cargo fmt --all -- --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo test --workspace --all-features
cargo test --test network_baseline
cargo bench --bench network --features bench-helpers
cargo test --test snapshot_integration -- --ignored
cargo test --test e2e_mount -- --ignored
voidbox-network-bench --iterations 3 ✅ TCP g2h ~1885 Mbps, RR p50 2 µs, CRR p50 ~10 ms
voidbox-startup-bench --iters 3 --breakdown ✅ cold p50 ~236 ms, warm p50 ~76 ms

Bench comparison

scripts/bench-compare.sh --baseline <Phase-0B.1-tip> --skip-vm (the earliest commit that has the bench file):

Bench Phase 0 baseline Phase 5.5b HEAD Δ
process_syn 23.6 µs 21.7 µs -8%
process_arp_request 5.06 µs 4.97 µs -1.9%
poll_idle 69.8 ns 170 ns +100 ns (sub-µs absolute, due to flow_table iteration + accept-channel drain at top of drain_to_guest)
new benches (e.g. tcp_bulk_throughput_1mb, port_forward_accept_latency, nat_translate_outbound_hot_path, process_udp_frame, process_icmp_echo_request, …) numerical baseline new capability — see file

The work's wins are primarily capability (ICMP/UDP/inbound port-forwarding work where they didn't before; the 256 KB TCP cliff is gone) and architecture (unified flow table, stateless NAT). Raw guest→host throughput is roughly flat (~1885 Mbps), which is expected — passt-style refactoring buys correctness and observability, not headline bytes/sec.

What's NOT in this branch (deferred)

  • Phase 6docs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.md — overview of TCP half-close, async connect, window mgmt, event-driven polling. Per-subsystem TDD plans deferred to when each kicks off.
  • IPv6 dual-stack — Phase 7, separately scoped.
  • Pre-existing conformance failures — 3 tests fail (use /tmp paths rejected by PR hardening: openat2-based symlink-escape guard for privileged FS RPCs (R-B2.1 + R-B2.2) #56's fs_guard); confirmed pre-existing on main, not Phase X regressions.
  • passt head-to-head benchmark — needs passt+qemu reference env; separate task.

Test plan for review

  • Run cargo test --test network_baseline — pins pass, the three flipped BROKEN_ON_PURPOSE pins are now positive assertions.
  • Run cargo bench --bench network --features bench-helpers tcp_bulk_throughput_1mb — should produce a measurement.
  • Run voidbox-network-bench --iterations 3 --bulk-mb 10 — both throughput numbers (constrained vs unconstrained) without errors.
  • Run voidbox-startup-bench --iters 3 --breakdown — warm phase exits 0.
  • Run scripts/bench-compare.sh --baseline 41c8382 --skip-vm — validates the comparison harness.
  • Skim docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md and 2026-04-30-smoltcp-passt-port-phase6.md for the design rationale and follow-up scope.

dpsoft added 30 commits April 27, 2026 17:52
Adds two planning docs under docs/superpowers/plans/:

- 2026-04-27-smoltcp-passt-port.md (spec)
  Supersedes the 2026-04-12 network-backend-abstraction design.
  Replaces "add passt as opt-in backend" with "lift passt's design
  patterns into our smoltcp stack" — keeps observability, all-Rust
  path, single binary, cross-platform parity. Lists required skills
  for execution (rust-style, rustdoc, rust-analyzer-ssr,
  superpowers TDD/verification, repo verify/profile). Maps the work
  into 5+1 phases with per-phase plan-doc placeholders.

- 2026-04-27-smoltcp-passt-port-phase0.md (Phase 0 plan)
  25 bite-sized TDD tasks: correctness baseline pins, divan
  microbenches, wall-clock e2e harness, NetworkBackend trait
  extraction, SlirpStack → SmoltcpBackend rename. Includes three
  BROKEN_ON_PURPOSE assertions that flip in later phases.
Add two baseline tests for the smoltcp DNS proxy:
- dns_query_resolves: sends a query for example.com, polls ≤20×100ms,
  asserts reply XID matches.
- dns_cache_keys_by_question_not_xid: warms cache with xid=1, then
  queries with xid=2 and asserts the stack rewrites the reply XID.

Both tests skip gracefully (eprintln + early return) when the upstream
resolver is unreachable, making them safe in offline CI.

Also adds QNAME_EXAMPLE_COM const and two module-scope helpers:
build_dns_query (builds a correct UDP DNS frame with proper payload_len)
and parse_dns_reply_xid.

SLIRP_DNS_IP added to the existing module-scope slirp import.
Implement measure_tcp_throughput_g2h: binds a host-side TCP listener,
boots a VM, execs dd|nc in the guest, drains to EOF on the host, and
computes Mbps from bytes_received / elapsed.  h2g left None with a TODO.
Implements measure_rr_latency and measure_crr_latency in
voidbox-network-bench, reusing the single shared VM booted for
throughput measurements.

RR: guest pipes N bytes over one persistent nc connection; host times
each read+write pair (first sample discarded to absorb connect jitter).
CRR: guest runs N independent nc invocations; host times each full
accept+read+write+close cycle. Both use the existing percentile()
helper (dead_code attribute removed). Latency measurements always run
regardless of --no-throughput.
Per user feedback: "Slirp" denotes the user-mode-NAT role; "smoltcp"
is the underlying library. Role-based naming keeps the public type
surface stable across library swaps and matches the symmetry of
future TapBackend / VhostNetBackend siblings.

Module file src/network/slirp.rs keeps its name (already aligned with
the new type, matches src/devices/virtio_net.rs convention).
The actual polling logic now lives in drain_to_guest, which writes
directly into the caller-supplied &mut Vec<Vec<u8>> buffer — no fresh
allocation on every tick. poll becomes a #[deprecated] shim:

    #[deprecated(note = "use drain_to_guest")]
    pub fn poll(&mut self) -> Vec<Vec<u8>> {
        let mut out = Vec::new();
        self.drain_to_guest(&mut out);
        out
    }

Existing call sites (virtio_net.rs, tests/network_baseline.rs,
benches/network.rs) are annotated with #[allow(deprecated)] and a
TODO(0D.4/0D.5) marker. They will be migrated in the next two tasks,
after which the allow attributes can be removed.
Switch VirtioNetDevice::slirp from Arc<Mutex<SlirpStack>> to
Arc<Mutex<dyn NetworkBackend>>, replacing the deprecated poll() call
in get_rx_frames with drain_to_guest into a reused rx_scratch buffer.

Update both VMM cold-boot and snapshot-restore construction sites to
coerce Arc<Mutex<SlirpStack>> to the trait object. All 14 baseline
tests pass; fmt and clippy clean.
Type rename only — the slirp.rs module file keeps its name.
SlirpBackend reflects the user-mode-NAT role rather than the
underlying smoltcp library, keeping naming symmetric with future
TapBackend / VhostNetBackend siblings.
Introduces the types and helper needed for ICMP echo NAT (Phase 1):

- IcmpEchoKey {guest_id, dst_ip}: hash key for the echo NAT table.
- IcmpEchoEntry {sock, guest_id, last_activity}: per-request state.
- open_icmp_socket(): opens SOCK_DGRAM/IPPROTO_ICMP (no CAP_NET_RAW).
- icmp_echo: HashMap<IcmpEchoKey, IcmpEchoEntry> field on SlirpBackend,
  initialized to HashMap::new() in with_security() (the canonical ctor;
  new() and Default both delegate through it).

No behavior change — handle_ipv4_frame is untouched, the map stays
empty. Dead-code allowances are scoped to the new items and will be
removed once tasks 1.2/1.3 wire them in.
Add the SynSent state to TcpNatState for host-initiated (port-forward)
connections.  When handle_tcp_frame sees SYN+ACK on a SynSent entry it
sends an ACK to the guest, advances our_seq, records guest_ack, and
transitions to Established — completing the inbound 3-way handshake.

Add #[cfg(test)] helpers on SlirpBackend (insert_synthetic_synsent_entry,
tcp_flow_state, injected_plain_ack_count) and a unit test
tcp_inbound_syn_ack_completes_handshake that seeds a SynSent entry,
feeds a guest SYN-ACK, and asserts (a) state → Established and (b)
one plain ACK queued for injection.  The full E2E contract is deferred
to task 5.5b.5 (tcp_port_forward_inbound in tests/network_baseline.rs).

build_tcp_packet_static signature: (src_ip, dst_ip, src_port, dst_port,
seq, ack, control, payload).  The inbound ACK uses src=SLIRP_GATEWAY_IP,
dst=SLIRP_GUEST_IP, src_port=key.dst_port (high port), dst_port=
key.guest_src_port.
Adds a divan bench for the SynSent → Established state-machine path
introduced in 5.5b.1.  The bench seeds one synthetic SynSent entry,
feeds a SYN-ACK frame to process_guest_frame, and measures the
transition cost (~42 µs median, same order as process_syn).

Approach (option a): widen the three #[cfg(test)] helpers on
SlirpBackend to #[cfg(any(test, feature = "bench-helpers"))].
insert_synthetic_synsent_entry is promoted to `pub` within the gated
impl block so the bench binary (a separate compilation unit) can call
it.  The feature is never enabled in production builds.  All helpers
in benches/network.rs that are only needed under bench-helpers are
gated with #[cfg(feature = "bench-helpers")] to keep the default bench
binary warning-free.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread benches/network.rs
Comment on lines +673 to +680
let mut table: HashMap<usize, u32> = HashMap::with_capacity(n);
// Insert phase
for (i, _key) in keys.iter().enumerate() {
table.insert(i, i as u32);
}
// Remove phase
for i in 0..n {
divan::black_box(table.remove(&i));
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flow_table_insert_remove pre-builds a keys vector but never uses it; the bench actually measures HashMap<usize, u32> insert/remove with integer keys. That doesn’t reflect the stated intent (unified flow-table key hashing/dispatch) and makes the result hard to interpret. Either remove the unused keys construction or use a key type representative of the real flow table (e.g., the actual FlowKey or a synthetic struct with similar hashing cost).

Suggested change
let mut table: HashMap<usize, u32> = HashMap::with_capacity(n);
// Insert phase
for (i, _key) in keys.iter().enumerate() {
table.insert(i, i as u32);
}
// Remove phase
for i in 0..n {
divan::black_box(table.remove(&i));
let mut table: HashMap<smoltcp::wire::IpAddress, u32> = HashMap::with_capacity(n);
// Insert phase
for (i, key) in keys.iter().copied().enumerate() {
table.insert(key, i as u32);
}
// Remove phase
for key in keys.iter() {
divan::black_box(table.remove(key));

Copilot uses AI. Check for mistakes.
Comment thread src/devices/virtio_net.rs
Comment on lines +663 to 670
// Drain backend frames into the reused scratch buffer.
self.rx_scratch.clear();
{
let mut backend = self.slirp.lock().unwrap();
backend.drain_to_guest(&mut self.rx_scratch);
}
let frames = std::mem::take(&mut self.rx_scratch);

Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_rx_frames clears rx_scratch, fills it, then mem::takes it. That replaces the field with a fresh empty Vec (capacity 0), so the next poll re-allocates and the scratch buffer is not actually reused as the comment claims. Consider draining from rx_scratch into frames/result (e.g., drain(..)), or swapping while preserving capacity, so the allocation is retained across calls.

Copilot uses AI. Check for mistakes.
Comment thread tests/network_baseline.rs
Comment on lines +647 to +650
/// Unlike `build_udp_frame` (which carries a pre-existing off-by-one in
/// the `payload_len` argument passed to `udp_repr.emit`), this helper
/// passes only the DNS payload length so the UDP `len` field is correct
/// and the stack's smoltcp parser accepts the frame.
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment claims build_udp_frame has an off-by-one bug in the payload_len passed to udp_repr.emit, but build_udp_frame currently passes payload.len() (which is the correct smoltcp API expectation). This mismatch is likely to confuse future refactors—either fix the comment or adjust build_udp_frame if it’s actually wrong.

Suggested change
/// Unlike `build_udp_frame` (which carries a pre-existing off-by-one in
/// the `payload_len` argument passed to `udp_repr.emit`), this helper
/// passes only the DNS payload length so the UDP `len` field is correct
/// and the stack's smoltcp parser accepts the frame.
/// This helper passes the DNS payload length expected by `udp_repr.emit`,
/// so the UDP `len` field is correct and the stack's smoltcp parser
/// accepts the frame.

Copilot uses AI. Check for mistakes.
Comment thread tests/network_baseline.rs
Comment on lines +315 to +322
unsafe {
libc::setsockopt(
listener.as_raw_fd(),
libc::SOL_SOCKET,
libc::SO_RCVBUF,
&val as *const libc::c_int as *const libc::c_void,
std::mem::size_of::<libc::c_int>() as libc::socklen_t,
);
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setsockopt(SO_RCVBUF) call’s return value is ignored. If this fails (e.g., unsupported or restricted), the test won’t actually exercise the intended backpressure scenario but will still pass/fail based on unrelated behavior. Please check the return code (and/or last_os_error) and either assert or skip the test when the option can’t be applied.

Suggested change
unsafe {
libc::setsockopt(
listener.as_raw_fd(),
libc::SOL_SOCKET,
libc::SO_RCVBUF,
&val as *const libc::c_int as *const libc::c_void,
std::mem::size_of::<libc::c_int>() as libc::socklen_t,
);
let rc = unsafe {
libc::setsockopt(
listener.as_raw_fd(),
libc::SOL_SOCKET,
libc::SO_RCVBUF,
&val as *const libc::c_int as *const libc::c_void,
std::mem::size_of::<libc::c_int>() as libc::socklen_t,
)
};
if rc != 0 {
eprintln!(
"skipping tcp_writes_more_than_256kb_succeed: failed to set SO_RCVBUF: {}",
std::io::Error::last_os_error()
);
return;

Copilot uses AI. Check for mistakes.
Comment thread src/bin/voidbox-network-bench/main.rs Outdated
dpsoft added 6 commits April 30, 2026 11:19
Adds the host listener thread infrastructure for Phase 5.5b inbound
TCP port-forwarding. No listener is spawned yet (that is task 5.5b.4),
so there is no behavior change in this commit.

New items in src/network/slirp.rs:
- `InboundAccept` struct (pub(crate)) — channel payload from listener to net-poll
- `port_forward_listeners: Vec<JoinHandle<()>>` field — holds spawn handles
- `port_forward_shutdown: Arc<AtomicBool>` field — graceful-shutdown signal
- `pending_inbound_accepts: mpsc::Receiver<InboundAccept>` field — accept channel rx
- `accept_sender: mpsc::Sender<InboundAccept>` field — keeps channel open + test helper
- `process_pending_inbound_accepts()` — drains channel, inserts SynSent entries, queues SYNs
- `run_port_forward_listener()` — module-scope thread fn, nonblocking accept loop
- `spawn_port_forward_listeners()` — pub(crate) factory, not called until 5.5b.4
- `Drop for SlirpBackend` — sets shutdown flag, joins all listener handles
- `push_inbound_accept()` on test-only impl — injects accepts for unit tests
- `drain_to_guest` now calls `process_pending_inbound_accepts()` as step 0

TDD: test `process_pending_inbound_accepts_seeds_synsent_and_queues_syn`
written and watched fail before implementation; passes in GREEN.
17/17 network_baseline integration tests unchanged.
Call `spawn_port_forward_listeners` in `SlirpBackend::with_security`
so host listener threads are actually spawned when `nat.port_forwards`
is non-empty. The function now also returns the `Sender` end of the
accept channel so `accept_sender` (needed to keep the channel open in
tests) is sourced from the same channel pair as `pending_inbound_accepts`.

Remove the `#[allow(dead_code)]` attrs from both functions.

Unit test `with_security_spawns_listener_per_tcp_port_forward` confirms
zero threads for empty rules and one thread per TCP rule.
Adds tcp_port_forward_inbound_connect_succeeds to tests/network_baseline.rs.

The test builds a SlirpBackend with one port-forward rule (18080→8080),
drives drain_to_guest in a loop while a host thread connects to
127.0.0.1:18080, synthesizes a guest listener by responding with SYN-ACK
to the SYN the stack emits, and asserts three contract points:

1. host TcpStream::connect succeeds — listener thread (5.5b.3) is alive.
2. drain_to_guest emits a synthesized SYN to GUEST_PORT — InboundAccept
   channel + process_pending_inbound_accepts + synthesize_inbound_syn
   (5.5b.2/5.5b.3/5.5b.4) all fired.
3. drain_to_guest emits the completing ACK after our SYN-ACK — the
   SynSent → Established arm (5.5b.1) fired.

Also adds parse_tcp_to_guest_full helper (superset of parse_tcp_to_guest
that also returns src/dst ports, needed to identify the ephemeral
high_port in the synthesized SYN).

No VM, no --ignored flag, completes in ~0.1 s.
…aseline

Times the full inbound port-forward path: host TcpStream::connect →
listener thread accept() → mpsc channel → process_pending_inbound_accepts
→ synthesize_inbound_syn → drain_to_guest output. Bounded above by
PORT_FORWARD_POLL_INTERVAL (50ms).

Regressions in the inbound state machine or listener poll loop now
surface numerically against this baseline.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/network_baseline.rs
Comment on lines +374 to +399
// Push 1 MB in 1 KB chunks. Drain after every batch so the
// host's read thread can drain the kernel buffer and ACKs flow
// back to the guest. The new TCP-backpressure path means some
// chunks won't be ACK'd immediately; we re-send those (TCP-style
// retransmit) until they go through.
const TOTAL: usize = 1024 * 1024;
const CHUNK: usize = 1024;
let chunk = vec![b'x'; CHUNK];
let mut seq = 1001u32;
let mut acked_seq = 1001u32;
let mut saw_close = false;
let deadline = std::time::Instant::now() + std::time::Duration::from_secs(10);

while bytes_received.load(Ordering::Relaxed) < TOTAL && std::time::Instant::now() < deadline {
// Send a chunk; advance our seq.
let _ = stack.process_guest_frame(&build_tcp_frame(
SLIRP_GATEWAY_IP,
GUEST_EPHEMERAL_PORT,
host_port,
seq,
our_seq + 1,
TcpControl::Psh,
&chunk,
));
seq = seq.wrapping_add(CHUNK as u32);

Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In tcp_writes_more_than_256kb_succeed, the loop unconditionally increments seq after each send, but it never actually retransmits when the stack stops ACKing (backpressure). This contradicts the test’s own comment (“we re-send those”) and can make the test either flaky or fail to validate the intended behavior. Consider only advancing seq once the corresponding ACK is observed (or explicitly resend the same seq/payload when acked_seq hasn’t advanced).

Copilot uses AI. Check for mistakes.
Comment thread tests/network_baseline.rs Outdated
exit_code = ?output.exit_code,
stderr = output.stderr_str(),
"g2h iteration non-zero exit; skipping"
);
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In measure_tcp_throughput_g2h, if the guest nc command exits non-zero you log a warning but still proceed to read the drain-thread result and may include that iteration in the mean. This can silently skew the reported throughput downward (or mask real failures). Consider continue-ing on !output.success() so only successful iterations contribute to the metric (and similarly for the bulk throughput path).

Suggested change
);
);
continue;

Copilot uses AI. Check for mistakes.
Compares HEAD against an arbitrary baseline ref using the two bench
harnesses: divan microbenches (cargo bench --bench network) and the
VM-backed wall-clock harness (voidbox-network-bench). Emits markdown
with absolute numbers + percent deltas, suitable for PR descriptions.

Replaces the scattered /tmp/baseline-network-phase*.json files with a
reproducible single entry-point.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +622 to +630
// The host accepts CRR_SAMPLES_PER_ITER connections, times each cycle,
// and sends results back over a channel.
let (crr_tx, crr_rx) = mpsc::channel::<Vec<Duration>>();
let sample_count = CRR_SAMPLES_PER_ITER;

std::thread::spawn(move || {
let samples = crr_echo_server(&listener, sample_count);
let _ = crr_tx.send(samples);
});
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same accept-thread leak pattern in CRR: the spawned thread blocks on listener.accept() and can persist past the iteration if the guest loop errors or no connections arrive, leaking resources and potentially stalling later runs. Use a bounded accept loop with timeout/cancellation, or accept in the main async task via a non-blocking listener.

Copilot uses AI. Check for mistakes.
Comment thread src/bin/voidbox-network-bench/main.rs Outdated
Comment thread scripts/bench-compare.sh

parse_divan() {
local file="$1"
LC_ALL=en_US.UTF-8 awk -F'│' '
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LC_ALL=en_US.UTF-8 is not guaranteed to exist on minimal distros/CI images, which can make parse_divan fail even when awk is present. Consider using a more universally available UTF-8 locale like C.UTF-8 (with a fallback to C) or avoid locale-dependence by normalizing the µs unit in the input before parsing.

Suggested change
LC_ALL=en_US.UTF-8 awk -F'' '
local awk_locale="C"
if command -v locale >/dev/null 2>&1; then
if locale -a 2>/dev/null | grep -Fxq 'C.UTF-8'; then
awk_locale='C.UTF-8'
elif locale -a 2>/dev/null | grep -Fxq 'C.utf8'; then
awk_locale='C.utf8'
elif locale -a 2>/dev/null | grep -Fxq 'en_US.UTF-8'; then
awk_locale='en_US.UTF-8'
fi
fi
LC_ALL="$awk_locale" awk -F'' '

Copilot uses AI. Check for mistakes.
Comment thread benches/network.rs Outdated
Comment on lines +372 to +376
let (drain_tx, drain_rx) = mpsc::channel::<(u64, Duration)>();
std::thread::spawn(move || {
let drain_result = drain_one_connection(&listener);
let _ = drain_tx.send(drain_result);
});
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same accept-thread leak pattern as measure_tcp_throughput_g2h: the spawned thread blocks on listener.accept() and can outlive the iteration on exec failure or recv_timeout, accumulating stuck threads and open FDs. Use a bounded accept with timeout/cancellation so the harness remains stable under failure conditions.

Copilot uses AI. Check for mistakes.
Comment on lines +504 to +509
let (echo_tx, echo_rx) = mpsc::channel::<Vec<Duration>>();

std::thread::spawn(move || {
let samples = rr_echo_server(&listener, RR_SAMPLES_PER_ITER);
let _ = echo_tx.send(samples);
});
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here: the RR echo server thread blocks on listener.accept() and isn’t joined/cancelled. If the guest nc fails to connect, the thread can hang indefinitely holding the listener, which can leak threads across iterations. Prefer non-blocking accept with a deadline, or a cancellable blocking task.

Copilot uses AI. Check for mistakes.
dpsoft added 6 commits April 30, 2026 18:47
Reviewer (Copilot) flagged that the previous skip-on-no-reply path
masked real ICMP regressions on hosts where unprivileged ICMP works.
Probe socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP) once: skip only on
EPERM/EACCES (sysctl net.ipv4.ping_group_range forbids it), assert
otherwise.
C1.3: measure_tcp_throughput_g2h and measure_bulk_throughput_g2h now
`continue` on guest nc non-zero exit so failed iterations don't skew
the reported mean.

C2.2: measure_icmp_rr_latency dropped its guest-side ping path. The
guest images intentionally omit /bin/ping (busybox-static lacks
CONFIG_FEATURE_PING_TYPE_DGRAM and SOCK_RAW would need root); the
function now returns None with a warn explaining the gap. Proper
host-driven measurement is tracked as a follow-up.
Aligns benches with the production RX path. The deprecated poll()
allocated a fresh Vec<Vec<u8>> per call; drain_to_guest appends to a
caller-owned buffer that's reused across iterations. CI now gates on
the same allocator pattern production code uses, removing avoidable
allocation overhead from the measurements.

Drops the file-level #![allow(deprecated)].
When the baseline ref pre-dates a464fc1 (Phase 5.5b.1) it doesn't have
the `bench-helpers` cargo feature, and cargo errored out, dropping the
entire baseline divan section to —.

Detect the "does not have feature" / "unknown feature" stderr signal
and retry without --features bench-helpers. Benches that exist at both
refs get real Δ%; the bench-helpers-gated ones (synthesize_inbound_syn,
tcp_inbound_syn_ack_transition) naturally remain — for baseline.

Unlocks bench-compare.sh --baseline origin/main against the full branch
history.
drain_one_connection, rr_echo_server, crr_echo_server now accept with
a deadline derived from LATENCY_RECV_TIMEOUT + slack. Previously each
spawned thread blocked forever on listener.accept() if the guest nc
never connected (exec error, network failure), holding the listener
FD across all subsequent iterations and burning thread/FD slots.

When the accept deadline lapses, the thread exits cleanly, the
listener drops, and the next iteration starts with a clean slate.

Addresses Copilot review C2.1, C2.5, C2.6.
… + epoll

Scopes the four architectural follow-ups surfaced in the
smoltcp-passt-port PR review:

- 6.1 (high)  : TCP half-close (FinWait*/CloseWait/LastAck) — silent
                data loss on shutdown(SHUT_WR) today.
- 6.2 (med-h) : Async outbound connect — vCPU thread blocked up to
                3 s on slow destinations.
- 6.3 (med)   : Window management + scaling — guest window ignored;
                advertised window hardcoded 65535.
- 6.4 (med-l) : Event-driven RX polling — replace 5 ms timer with
                epoll_wait.

Locks the observability + cross-platform + snapshot invariants from
the top-level spec. Per-subsystem TDD task lists deferred to dedicated
plans (-phase6.1.md..-phase6.4.md) written before each kicks off.
@dpsoft dpsoft changed the title WIP: SLIRP refactor — lift passt patterns into our smoltcp stack (Phases 0–3) SLIRP refactor — lift passt patterns into our smoltcp stack (Phases 0–5 + 5.5b) Apr 30, 2026
@dpsoft dpsoft marked this pull request as ready for review April 30, 2026 22:21
@dpsoft dpsoft merged commit f090868 into main May 6, 2026
22 checks passed
@dpsoft dpsoft deleted the smoltcp-passt-port-phase0 branch May 6, 2026 00:00
dpsoft added a commit that referenced this pull request May 6, 2026
…ghput vs main) (#69)

* docs(plans): smoltcp passt-pattern port — spec + Phase 0 plan

Adds two planning docs under docs/superpowers/plans/:

- 2026-04-27-smoltcp-passt-port.md (spec)
  Supersedes the 2026-04-12 network-backend-abstraction design.
  Replaces "add passt as opt-in backend" with "lift passt's design
  patterns into our smoltcp stack" — keeps observability, all-Rust
  path, single binary, cross-platform parity. Lists required skills
  for execution (rust-style, rustdoc, rust-analyzer-ssr,
  superpowers TDD/verification, repo verify/profile). Maps the work
  into 5+1 phases with per-phase plan-doc placeholders.

- 2026-04-27-smoltcp-passt-port-phase0.md (Phase 0 plan)
  25 bite-sized TDD tasks: correctness baseline pins, divan
  microbenches, wall-clock e2e harness, NetworkBackend trait
  extraction, SlirpStack → SmoltcpBackend rename. Includes three
  BROKEN_ON_PURPOSE assertions that flip in later phases.

* test(network): scaffold network_baseline pins with frame helpers

* test(network): address review — restore reserved constants, alias IpAddress

* test(network): pin TCP handshake SYN-ACK emission

* test(network): pin TCP guest↔host data round-trip

* test(network): BROKEN_ON_PURPOSE pin — 256 KB to_host cliff

* test(network): hoist inline `use` statements to module scope

* test(network): pin TCP rate limit, concurrent cap, deny list

* test(network): pin ARP reply behavior for gateway and subnet

* test(network): pin DNS resolution and cache xid-rewrite

Add two baseline tests for the smoltcp DNS proxy:
- dns_query_resolves: sends a query for example.com, polls ≤20×100ms,
  asserts reply XID matches.
- dns_cache_keys_by_question_not_xid: warms cache with xid=1, then
  queries with xid=2 and asserts the stack rewrites the reply XID.

Both tests skip gracefully (eprintln + early return) when the upstream
resolver is unreachable, making them safe in offline CI.

Also adds QNAME_EXAMPLE_COM const and two module-scope helpers:
build_dns_query (builds a correct UDP DNS frame with proper payload_len)
and parse_dns_reply_xid.

SLIRP_DNS_IP added to the existing module-scope slirp import.

* test(network): fix build_udp_frame payload_len double-count

* test(network): BROKEN_ON_PURPOSE pin — UDP non-DNS dropped

* test(network): BROKEN_ON_PURPOSE pin — ICMP echo dropped

* bench(network): divan microbenches for SLIRP hot paths

* bench(network): parametric NAT-walk scaling at 1/100/1000 flows

* bench(network): DNS cache hit and miss paths

* ci(bench): include network microbenches in regression gate

* bench(network): voidbox-network-bench binary scaffold

* bench(network): TCP throughput via busybox nc

Implement measure_tcp_throughput_g2h: binds a host-side TCP listener,
boots a VM, execs dd|nc in the guest, drains to EOF on the host, and
computes Mbps from bytes_received / elapsed.  h2g left None with a TODO.

* bench(network): TCP RR/CRR latency p50/p99

Implements measure_rr_latency and measure_crr_latency in
voidbox-network-bench, reusing the single shared VM booted for
throughput measurements.

RR: guest pipes N bytes over one persistent nc connection; host times
each read+write pair (first sample discarded to absorb connect jitter).
CRR: guest runs N independent nc invocations; host times each full
accept+read+write+close cycle. Both use the existing percentile()
helper (dead_code attribute removed). Latency measurements always run
regardless of --no-throughput.

* bench(network): UDP DNS qps and JSON report output

* docs(plans): rename SmoltcpBackend → SlirpBackend in spec + Phase 0 plan

Per user feedback: "Slirp" denotes the user-mode-NAT role; "smoltcp"
is the underlying library. Role-based naming keeps the public type
surface stable across library swaps and matches the symmetry of
future TapBackend / VhostNetBackend siblings.

Module file src/network/slirp.rs keeps its name (already aligned with
the new type, matches src/devices/virtio_net.rs convention).

* feat(network): introduce NetworkBackend trait

* refactor(slirp): add drain_to_guest wrapper for trait fit

* refactor(slirp): move poll body into drain_to_guest, drop alloc

The actual polling logic now lives in drain_to_guest, which writes
directly into the caller-supplied &mut Vec<Vec<u8>> buffer — no fresh
allocation on every tick. poll becomes a #[deprecated] shim:

    #[deprecated(note = "use drain_to_guest")]
    pub fn poll(&mut self) -> Vec<Vec<u8>> {
        let mut out = Vec::new();
        self.drain_to_guest(&mut out);
        out
    }

Existing call sites (virtio_net.rs, tests/network_baseline.rs,
benches/network.rs) are annotated with #[allow(deprecated)] and a
TODO(0D.4/0D.5) marker. They will be migrated in the next two tasks,
after which the allow attributes can be removed.

* feat(slirp): impl NetworkBackend for SlirpStack

* refactor(virtio_net): hold dyn NetworkBackend, reuse rx buffer

Switch VirtioNetDevice::slirp from Arc<Mutex<SlirpStack>> to
Arc<Mutex<dyn NetworkBackend>>, replacing the deprecated poll() call
in get_rx_frames with drain_to_guest into a reused rx_scratch buffer.

Update both VMM cold-boot and snapshot-restore construction sites to
coerce Arc<Mutex<SlirpStack>> to the trait object. All 14 baseline
tests pass; fmt and clippy clean.

* refactor(network): rename SlirpStack to SlirpBackend

Type rename only — the slirp.rs module file keeps its name.
SlirpBackend reflects the user-mode-NAT role rather than the
underlying smoltcp library, keeping naming symmetric with future
TapBackend / VhostNetBackend siblings.

* docs(plans): add Phase 1 plan (ICMP echo via SOCK_DGRAM IPPROTO_ICMP)

* feat(slirp): add IcmpEchoEntry + IPPROTO_ICMP socket helper

Introduces the types and helper needed for ICMP echo NAT (Phase 1):

- IcmpEchoKey {guest_id, dst_ip}: hash key for the echo NAT table.
- IcmpEchoEntry {sock, guest_id, last_activity}: per-request state.
- open_icmp_socket(): opens SOCK_DGRAM/IPPROTO_ICMP (no CAP_NET_RAW).
- icmp_echo: HashMap<IcmpEchoKey, IcmpEchoEntry> field on SlirpBackend,
  initialized to HashMap::new() in with_security() (the canonical ctor;
  new() and Default both delegate through it).

No behavior change — handle_ipv4_frame is untouched, the map stays
empty. Dead-code allowances are scoped to the new items and will be
removed once tasks 1.2/1.3 wire them in.

* refactor(slirp): hoist FromRawFd to module scope, drop redundant use libc

* feat(slirp): forward guest ICMP echo via SOCK_DGRAM IPPROTO_ICMP

* feat(slirp): relay ICMP echo replies back to guest

Add `relay_icmp_echo` which drains replies from each active ICMP echo
socket and injects Ethernet/IPv4/ICMP echo-reply frames back into the
guest.  Add `build_icmp_echo_reply_to_guest` which parses the raw ICMP
payload from the `SOCK_DGRAM IPPROTO_ICMP` socket, rewrites the ident
back to the guest's original value, and builds a complete wire frame.
Wire both into `drain_to_guest` immediately after `relay_tcp_nat_data`.

Drop the now-stale `#[allow(dead_code)]` on `IcmpEchoEntry::guest_id`
which is read by `relay_icmp_echo`.

* feat(slirp): warn-once + fallback when unprivileged ICMP forbidden

* test(network): flip ICMP pin — assert echo reply (was BROKEN_ON_PURPOSE)

Renames `icmp_echo_silently_dropped` → `icmp_echo_returns_reply`.
Targets 127.0.0.1 (loopback), polls 20 × 50ms for the reply, and
skips via eprintln! if sysctl forbids unprivileged ICMP — consistent
with how `dns_query_resolves` handles offline environments.

* bench(network): populate ICMP RR latency p50

Add measure_icmp_rr_latency() to voidbox-network-bench. Runs busybox
ping -c <count> -W 1 -i 0.05 8.8.8.8 inside the guest, parses
time=<ms> fields, converts to microseconds, and returns the p50 median.
Falls back to None + WARN on non-zero exit or empty parse (unreachable
network). Wired into main after measure_dns_qps; always runs regardless
of --no-throughput.

Also: allow unprivileged ICMP sockets in guest-agent (ping_group_range)
and add ping + setuid busybox to the test initramfs build.

* fix(scripts): revert setuid busybox in test image (Phase 1.6 regression)

Phase 1.6 (commit 8572122) added `chmod u+s "$OUT_DIR/bin/busybox"` to
let busybox `ping` open SOCK_RAW. The unintended consequence: cpio is
packed as the build user (uid 1000), so the kernel drops euid to 1000
on every execve from PID 1.

In `guest-agent::setup_network`, that meant `ip link up`, `ip addr
replace`, and `udhcpc` all silently failed with EPERM (no CAP_NET_ADMIN).
The static-fallback loop wasted 10s of boot time. Combined with the
vsock listener creation retry, total guest-agent startup exceeded the
host's 30s control-channel handshake deadline → ECONNRESET on every
connect → `voidbox-network-bench` and any test using `network(true)`
failed with `control_channel: deadline reached`.

Verification:
  - With setuid: bench fails consistently after 122 connect attempts.
  - Without setuid: bench produces clean numbers matching Phase 0
    baseline (TCP RR p50=2µs, CRR p50=10176µs, DNS qps=0.5).

The `ping` symlink is also dropped because busybox-static on Fedora is
not built with CONFIG_FEATURE_PING_TYPE_DGRAM, so unprivileged ICMP
is unavailable to the guest applet regardless. ICMP measurement in
voidbox-network-bench now reports `null` cleanly ("ping: not found")
until we route ICMP RR through SLIRP from the host instead.

The companion `ping_group_range` write in guest-agent stays — it's
harmless and supports future tooling that uses SOCK_DGRAM IPPROTO_ICMP.

* docs(plans): add Phase 2 plan (generalize UDP via per-flow connected sockets)

* feat(slirp): add UdpFlowEntry + per-flow connected socket helper

* feat(slirp): forward non-DNS UDP via per-flow connected sockets

Implement Task 2.2: route all guest UDP through handle_udp_frame, which
creates/reuses a per-flow connected UdpSocket keyed on (guest_src_port,
dst_ip, dst_port). DNS to SLIRP_DNS_IP still dispatches to the existing
handle_dns_frame. SLIRP_GATEWAY_IP (10.0.2.2) is translated to 127.0.0.1
before connect(), matching the TCP NAT path.

Drop #[allow(dead_code)] from UdpFlowEntry (item-level), open_udp_flow_socket,
and the udp_flows field — all now consumed. Add a field-targeted
#[allow(dead_code)] on last_activity (written here, read by Task 2.4).

Flip the udp_non_dns_silently_dropped BROKEN_ON_PURPOSE pin: datagrams
now reach the bound host socket, confirming the guest→host send path works.
All 14 baseline tests pass.

* ci(bench): add strict voidbox-network-bench step (no continue-on-error)

Catches setuid-busybox-style regressions that masquerade as
"environment flakes". Specifically: the bug fixed at 77dfc67
(Phase 1.6 added `chmod u+s busybox`, dropping PID 1 euid → no
CAP_NET_ADMIN → setup_network silently fails → 30s handshake
deadline expires → ECONNRESET) would have been visible in CI from
the start if the wall-clock harness step weren't behind
`continue-on-error: true`.

This new step runs `voidbox-network-bench --iterations 3` and
publishes the JSON metrics to the step summary. Failure of the
harness fails the workflow — no masking. The existing
`voidbox-startup-bench` step keeps `continue-on-error` for now
because its warm-restore phase has a separate, unfixed issue
(`control_channel[multiplex-establish]: deadline reached`
reproducible on main); flipping that to strict belongs in the
PR that fixes the warm-restore handshake.

Vhost-vsock probe still gates the run via `/dev/vhost-vsock`
existence check — runners without it skip cleanly with a warning,
since absence-of-device is an environment fact, not a regression.

* feat(slirp): relay UDP flow replies back to guest

Add `relay_udp_flows` and `build_udp_reply_to_guest` to `SlirpBackend`.
Each active UDP flow socket is polled non-blocking on every
`drain_to_guest` tick; replies are wrapped in an Ethernet/IPv4/UDP
frame (src=original-dst, dst=guest) and pushed into `inject_to_guest`.
Wire the call into `drain_to_guest` after `relay_icmp_echo`. Also add
`UdpRepr` to the smoltcp wire imports and drop the now-consumed
`#[allow(dead_code)]` on `UdpFlowEntry::last_activity`.

* feat(slirp): UDP flow idle reap (60s)

* test(network): full RTT for UDP pin (was BROKEN_ON_PURPOSE one-way)

Rename udp_non_dns_silently_dropped → udp_non_dns_round_trips and
rewrite the body to verify the complete guest→host→guest round-trip
via the per-flow connected-socket NAT landed in Tasks 2.1–2.4.

* bench(network): document DNS qps busybox-nc bottleneck (set null + WARN)

busybox nc -u -w1 blocks for the full 1-second timeout after stdin EOF
even when the cached SLIRP reply arrives in microseconds, capping
throughput at ~1 qps.  Tighter flags tried: -q0 exits before the reply
arrives (0 successes); /dev/udp/ is bash-only; timeout(1) is absent
from the test initramfs.

Report udp_dns_qps as null with a WARN pointing to the host-side UDP
socket path as the correct future fix.  Also removes the now-dead
DNS_QPS_WINDOW_SECS and SLIRP_DNS_ADDR constants.

* fix(startup-bench): require userspace vsock backend for snapshot capture

The bench's `capture_snapshot` was building a Sandbox without
`.enable_snapshots(true)`, so the backend selector at
`backend/kvm.rs:212` chose `VsockBackendType::Vhost` (lower per-RPC
latency for cold-only runs). The `create_auto_snapshot` call then
captured a vhost-shaped snapshot. But `from_snapshot` always restores
into `VsockBackendType::Userspace` — a path that knows how to
re-program our process-local vring state, while vhost's vring state
lives in the host kernel's `vhost-vsock` module and isn't part of
the snapshot at all. Result: the restored userspace device has
half-blank state, never accepts connections from the host, every
connect attempt is RST'd by the guest kernel, and the multiplex
handshake hits its 30s deadline.

Symptom across CI and local Fedora bare-metal:
  control_channel[multiplex-establish]: deadline reached after 123
    connect/handshake attempts
  Error: Guest("control_channel: deadline reached")

This same failure was visible in CI run 24983657846 on main (April 27,
before any of the SLIRP refactor work) — masked by `continue-on-error:
true` on the wall-clock harness step. Removing both: the fix and the
CI mask, so a regression of this exact shape would now fail the
workflow.

Verified locally: `voidbox-startup-bench --iters 3 --breakdown` now
exits 0 with `warm.total p50 = 82ms` (well within the CHANGELOG's
138ms target). Cold phase numbers unchanged (~245ms p50).

Refs:
- backend/kvm.rs:205-216 (the backend selector)
- CHANGELOG.md:74 ("Snapshot/Restore for KVM ... userspace virtio-vsock backend")
- AGENTS.md:1185 ("snapshot_integration ... Uses userspace virtio-vsock backend")

* docs(plans): add Phase 3 plan (TCP relay rewrite via MSG_PEEK + sequence mirroring)

* docs(plans): lock observability as a hard non-negotiable invariant

Per user pushback ("the improvements based on passt, maintain our
differentiator of full observability on the SLIRP implementation,
that is a must?") — yes, and it should be stated explicitly, not
assumed.

Spec gets a "Hard invariant — observability" section right after the
motivation. Phase 3 plan gets a "Non-negotiable invariants" block
that codifies what every task in the high-risk TCP-relay rewrite
must preserve:

- All-Rust, no opaque-process boundary; libc syscalls are fine.
- tracing instrumentation at every state transition (peek, ACK
  consume, close); new code must add new events for new state.
- cargo-test-driveable behavior via tests/network_baseline.rs.
- Standard Rust tooling (LSP, clippy, profiler) keeps working.

Future phases inherit the spec-level invariant; their per-phase
plans will reiterate the task-level acceptance criteria.

* refactor(slirp): add bytes_in_flight to TcpNatEntry (no behavior change)

Scaffolding for Task 3.3/3.4: tracks bytes sent to guest but not yet
ACK'd. Initialized to 0 at all construction sites; dead_code suppressed
until Task 3.3 consumes it.

* refactor(slirp): add recv_peek helper using libc::recv MSG_PEEK

* refactor(slirp): peek-based host→guest TCP relay (drops to_guest dependency)

Replace the read-into-to_guest + drain loop in relay_tcp_nat_data with
a MSG_PEEK-based send path. recv_peek() peeks the kernel's recv buffer
without consuming it; only the bytes past bytes_in_flight are chunked
into TCP segments and injected toward the guest. our_seq and
bytes_in_flight advance as segments are sent; the kernel buffer holds
the data until Task 3.4's ACK-driven read() consumes it.

Remove #[allow(dead_code)] from recv_peek and bytes_in_flight (both
now consumed). Add #[allow(dead_code)] to to_guest (still on struct;
Task 3.6 removes it). Drop unused Read import.

Tracing: trace! per peek+send cycle, debug! on host EOF, warn! on
recv_peek errors.

* refactor(slirp): ACK-driven consume from kernel socket

When the guest ACKs data we relayed via the peek-based host→guest path
(Task 3.3), read() those bytes from the host_stream to advance the
kernel's recv buffer read pointer.  Without this the kernel buffer fills
up and recv_peek keeps returning the same already-sent bytes.

Logic in handle_tcp_frame, Established branch:
- Extract segment_ack from tcp.ack_number().0 as u32.
- Compute last_sent_acked = our_seq.wrapping_sub(bytes_in_flight).
- acked_bytes = segment_ack.wrapping_sub(last_sent_acked) — wrapping
  arithmetic throughout because TCP sequence numbers wrap at 2^32.
- Guard: only consume when acked_bytes > 0 && <= bytes_in_flight,
  defending against duplicate/spurious/malformed guest ACKs.
- Drain via read() loop into a stack sink; decrement bytes_in_flight by
  the actual drained count, not the claimed acked_bytes.
- tracing::trace! on each consume; tracing::warn! + Closed on read error.

All 14 network_baseline tests pass including tcp_data_round_trip.

* refactor(slirp): drop to_host buffer + 256KB cliff, use TCP backpressure

Replace the guest→host write path with don't-ACK-on-EAGAIN: on WouldBlock,
skip the ACK and let the guest TCP retransmit once the kernel send buffer
drains. Remove the MAX_TO_HOST_BUFFER constant (256 KB cap), the overflow-
close branch, and the relay_tcp_nat_data to_host flush block. Mark to_host
and to_host_pending_ack dead_code pending Task 3.6 cleanup.

* refactor(slirp): drop to_guest/to_host/pending_ack fields and dead helpers

Remove three dead fields from TcpNatEntry that were superseded by the
passt-style peek+ACK+backpressure paths added in Tasks 3.3–3.5:
- to_guest: Vec<u8> (replaced by recv(MSG_PEEK)-based send in 3.3)
- to_host: Vec<u8> (replaced by direct write + don't-ACK-on-WouldBlock in 3.5)
- to_host_pending_ack: Option<u32> (replaced by ACK on n_written in 3.5)

Remove the three matching initializer sites from the TcpNatEntry constructor.
Update the file-level Architecture doc and the bytes_in_flight field comment
to reflect the Phase 3 design (no userspace buffers; kernel socket buffer
holds outstanding data).

* test(network): flip 256KB cliff pin — assert >1MB throughput succeeds

Renames tcp_to_host_buffer_drops_at_256kb → tcp_writes_more_than_256kb_succeed
and rewrites the body to assert the Phase 3 positive contract: pushing 1 MB
through the relay succeeds with no RST/FIN mid-stream. Updates the file-level
BROKEN_ON_PURPOSE inventory accordingly.

* bench(network): tcp_bulk_throughput_1mb — measures post-Phase-3 backpressure path

Adds a divan microbench that pushes 1 MiB through the SLIRP relay under
a constrained host receiver (SO_RCVBUF=4096), forcing the passt-style
sequence-mirroring + don't-ACK-on-EAGAIN backpressure path on every
iteration. Reports throughput in MB/s via BytesCount so regressions are
numerically visible. Mirrors the 95%-delivery threshold from the
tcp_writes_more_than_256kb_succeed contract test. ~61 ms median /
~17 MB/s on this host; 10 samples complete in well under 60 s total.

* bench(network): --bulk-mb mode to measure post-Phase-3 backpressure

Adds a guest→host throughput measurement that pins the host listener's
SO_RCVBUF to 4096 before accept(). The constrained receiver forces
TCP-level backpressure to engage during the transfer: the SLIRP relay's
non-blocking write to host_stream returns WouldBlock, the relay
declines to ACK the segment, and the guest retransmits — exercising
the don't-ACK-on-EAGAIN path that Phase 3 introduced. Pre-Phase-3 the
same scenario hit the 256 KB userspace cliff and reset the connection
mid-transfer; post-Phase-3 the bytes go through.

Smoke run on this host (Fedora 43 / KVM / slim x86_64):
  bulk-g2h[ 0]: 10485760 B in 0.429s = 1565.6 Mbps (constrained receiver)

Compare to the unconstrained tcp_throughput_g2h_mbps (~1885 Mbps) —
the ~17% reduction is the backpressure cost. The metric is opt-in
(--bulk-mb 0 by default) so it doesn't slow down standard runs.

Companion to the divan microbench tcp_bulk_throughput_1mb (commit
5fe4316) that exercises the same path at the unit level. The wall-clock
metric is what we'd compare against passt+qemu in the future
side-by-side run.

* docs(plans): add Phase 4 plan (unified flow table refactor)

* refactor(slirp): define FlowKey + FlowEntry enums (no callers yet)

Add Copy to NatKey (all fields are trivially copyable: u16, Ipv4Address,
u16) and clean up three clone_on_copy sites that clippy now catches.

Introduce FlowKey and FlowEntry alongside the existing per-protocol types;
both are marked #[allow(dead_code)] until Task 4.2 wires the unified
flow_table field.

* fix(ci): non-Linux stubs for benches/network.rs + voidbox-network-bench

Both files used file-level `#![cfg(target_os = "linux")]`, which on
macOS produces an empty crate with no `main()` → E0601. Caught by
PR #68's macOS CI lanes (Lint, MSRV, Test, E2E VZ).

Fix mirrors `benches/startup.rs`: keep the `main()` shape unconditional
and gate only the SLIRP-using imports + body. The smoltcp dep is
already `cfg(target_os = "linux")` in Cargo.toml, so the Linux-only
items genuinely can't compile on macOS — wrapping them in a Linux-only
module is the cleanest way to keep the cfg gating in one place.

- `benches/network.rs`: `mod linux_benches { ... }` wraps every helper
  and `#[divan::bench]`. Top-level `fn main()` calls `divan::main()` on
  Linux and prints a skip notice elsewhere.
- `src/bin/voidbox-network-bench/main.rs`: `mod linux_main { ... }`
  wraps everything from `TRANSFER_MB` to the bottom of the file.
  Top-level provides two cfg-gated `fn main()` shapes — Linux delegates
  to `linux_main::main_impl()`, non-Linux prints a skip notice.

Linux validation:
  - cargo fmt --check: clean
  - cargo clippy -D warnings: clean
  - cargo test --test network_baseline: 14/14

* docs(plans): add three Phase 4 benches (mixed flows, per-protocol, table ops)

* refactor(slirp): add flow_table field on SlirpBackend (parallel to existing maps)

* refactor(slirp): migrate ICMP path to flow_table

Replace all self.icmp_echo accesses (5 sites: entry insert/lookup in
handle_icmp_frame, keys iteration + get_mut + remove in relay_icmp_echo)
with self.flow_table keyed on FlowKey::IcmpEcho.

Drop the icmp_echo field and its HashMap::new() initializer.
Drop #[allow(dead_code)] from FlowKey, FlowEntry, and flow_table;
add variant-level #[allow(dead_code)] on Tcp/Udp variants that are
consumed in tasks 4.4 and 4.5.

All 14 network_baseline pins pass; fmt + clippy clean.

* refactor(slirp): migrate UDP path to flow_table

Replace all self.udp_flows accesses with self.flow_table keyed on
FlowKey::Udp / FlowEntry::Udp, following the same pattern as the
ICMP migration in 4.3. Drop the udp_flows field and its HashMap::new()
initializer. Remove #[allow(dead_code)] from FlowKey::Udp and
FlowEntry::Udp now that both variants are consumed.

* refactor(slirp): migrate TCP path to flow_table

Move all 9 self.tcp_nat accesses to self.flow_table under FlowKey::Tcp /
FlowEntry::Tcp.  Drop the tcp_nat field and its HashMap::new() initialiser.
Remove #[allow(dead_code)] from FlowKey::Tcp and FlowEntry::Tcp now that
both variants are actively consumed.

Max-concurrent check now counts FlowKey::Tcp entries in flow_table.
relay_tcp_nat_data collects TCP flow keys then iterates with get_mut,
matching the established ICMP/UDP patterns from 4.3–4.4.

All 14 network_baseline tests pass; tcp_bulk_throughput_1mb bench: 17.06 MB/s.

* refactor(slirp): update Phase 4 doc header for unified flow table

* bench(network): poll_with_n_mixed_flows — mixed TCP/UDP/ICMP at scale

* bench(network): process_udp_frame + process_icmp_echo_request

* bench(network): add flow_table_insert_remove synthetic microbench

Pure-compute baseline for Phase 4's unified HashMap<FlowKey, FlowEntry>.
Measures insert + remove throughput on n=[10, 100, 1000] entries using
synthetic u32 values, isolating HashMap mechanics from socket overhead.
Phase 5+ reference number for hasher experiments (foldhash, ahash, SipHash)
or container-shape changes (hashbrown raw API).

Uses proxy data (usize -> u32 map) instead of real TcpNatEntry to avoid
socket cloning cost per insert — the bench goal is HashMap cost, not socket ops.

* docs(plans): add Phase 5 plan (stateless NAT + port forwarding)

* feat(network): add nat.rs with stateless translate_outbound (no callers yet)

Pure types (Rules, PortForward, ForwardProto) and translate_outbound function
that maps guest destination addresses to host SocketAddrs. No per-flow state;
deny-list check beats gateway-loopback rewrite. Doc-test + unit tests included.

* refactor(slirp): add nat::Rules field on SlirpBackend (parallel to deny_list)

* refactor(slirp): TCP path uses nat::translate_outbound

Replace the two inline operations in the SYN branch of handle_tcp_frame
with a single nat::translate_outbound call:
  - the SLIRP_GATEWAY_IP → 127.0.0.1 rewrite
  - the deny-list iteration (previously via is_denied)

The RST-emission shape and warn! event are preserved verbatim. Drop the
now-callerless is_denied method; add #[allow(dead_code)] to deny_list
(still held for task 5.4 which migrates UDP and then drops the field).

* refactor(slirp): UDP path uses nat::translate_outbound, drop deny_list field

* refactor(slirp): plumb port_forwards from NetworkConfig into nat::Rules

Add `port_forwards: &[(u16, u16)]` to `SlirpBackend::with_security`.
Each tuple is mapped to `nat::PortForward { proto: ForwardProto::Tcp, .. }`
and stored in `nat::Rules.port_forwards`.  `SlirpBackend::new()` passes
`&[]` as before.

The cold-boot VMM construction site (`src/vmm/mod.rs`) also passes `&[]`
with a TODO(5.5b) comment — `VoidBoxConfig` does not yet carry
`port_forwards`, so wiring the real slice is deferred to sub-task B.
The snapshot-restore site calls `SlirpBackend::new()` and is unaffected.

No relay code reads `nat.port_forwards`; no host listeners are spawned.
Sub-task B (5.5b) will add the actual TcpListener-per-rule logic.

All 14 network_baseline tests pass. fmt + clippy clean.

* test(network): pin nat::translate_outbound (loopback, external, deny)

* bench(network): nat_translate_outbound_hot_path — Phase 5 baseline

* feat(slirp): TcpNatState::SynSent + handle inbound SYN-ACK

Add the SynSent state to TcpNatState for host-initiated (port-forward)
connections.  When handle_tcp_frame sees SYN+ACK on a SynSent entry it
sends an ACK to the guest, advances our_seq, records guest_ack, and
transitions to Established — completing the inbound 3-way handshake.

Add #[cfg(test)] helpers on SlirpBackend (insert_synthetic_synsent_entry,
tcp_flow_state, injected_plain_ack_count) and a unit test
tcp_inbound_syn_ack_completes_handshake that seeds a SynSent entry,
feeds a guest SYN-ACK, and asserts (a) state → Established and (b)
one plain ACK queued for injection.  The full E2E contract is deferred
to task 5.5b.5 (tcp_port_forward_inbound in tests/network_baseline.rs).

build_tcp_packet_static signature: (src_ip, dst_ip, src_port, dst_port,
seq, ack, control, payload).  The inbound ACK uses src=SLIRP_GATEWAY_IP,
dst=SLIRP_GUEST_IP, src_port=key.dst_port (high port), dst_port=
key.guest_src_port.

* bench(network): tcp_inbound_syn_ack_transition — Phase 5.5b.1 microbench

Adds a divan bench for the SynSent → Established state-machine path
introduced in 5.5b.1.  The bench seeds one synthetic SynSent entry,
feeds a SYN-ACK frame to process_guest_frame, and measures the
transition cost (~42 µs median, same order as process_syn).

Approach (option a): widen the three #[cfg(test)] helpers on
SlirpBackend to #[cfg(any(test, feature = "bench-helpers"))].
insert_synthetic_synsent_entry is promoted to `pub` within the gated
impl block so the bench binary (a separate compilation unit) can call
it.  The feature is never enabled in production builds.  All helpers
in benches/network.rs that are only needed under bench-helpers are
gated with #[cfg(feature = "bench-helpers")] to keep the default bench
binary warning-free.

* feat(slirp): add synthesize_inbound_syn helper for port-forwarding

* bench(network): synthesize_inbound_syn pure-compute (Phase 5.5b.2.b)

* feat(slirp): port-forward listener thread implementation (not wired yet)

Adds the host listener thread infrastructure for Phase 5.5b inbound
TCP port-forwarding. No listener is spawned yet (that is task 5.5b.4),
so there is no behavior change in this commit.

New items in src/network/slirp.rs:
- `InboundAccept` struct (pub(crate)) — channel payload from listener to net-poll
- `port_forward_listeners: Vec<JoinHandle<()>>` field — holds spawn handles
- `port_forward_shutdown: Arc<AtomicBool>` field — graceful-shutdown signal
- `pending_inbound_accepts: mpsc::Receiver<InboundAccept>` field — accept channel rx
- `accept_sender: mpsc::Sender<InboundAccept>` field — keeps channel open + test helper
- `process_pending_inbound_accepts()` — drains channel, inserts SynSent entries, queues SYNs
- `run_port_forward_listener()` — module-scope thread fn, nonblocking accept loop
- `spawn_port_forward_listeners()` — pub(crate) factory, not called until 5.5b.4
- `Drop for SlirpBackend` — sets shutdown flag, joins all listener handles
- `push_inbound_accept()` on test-only impl — injects accepts for unit tests
- `drain_to_guest` now calls `process_pending_inbound_accepts()` as step 0

TDD: test `process_pending_inbound_accepts_seeds_synsent_and_queues_syn`
written and watched fail before implementation; passes in GREEN.
17/17 network_baseline integration tests unchanged.

* feat(slirp): wire spawn_port_forward_listeners from with_security

Call `spawn_port_forward_listeners` in `SlirpBackend::with_security`
so host listener threads are actually spawned when `nat.port_forwards`
is non-empty. The function now also returns the `Sender` end of the
accept channel so `accept_sender` (needed to keep the channel open in
tests) is sourced from the same channel pair as `pending_inbound_accepts`.

Remove the `#[allow(dead_code)]` attrs from both functions.

Unit test `with_security_spawns_listener_per_tcp_port_forward` confirms
zero threads for empty rules and one thread per TCP rule.

* test(network): tcp_port_forward_inbound — Phase 5.5b e2e contract

Adds tcp_port_forward_inbound_connect_succeeds to tests/network_baseline.rs.

The test builds a SlirpBackend with one port-forward rule (18080→8080),
drives drain_to_guest in a loop while a host thread connects to
127.0.0.1:18080, synthesizes a guest listener by responding with SYN-ACK
to the SYN the stack emits, and asserts three contract points:

1. host TcpStream::connect succeeds — listener thread (5.5b.3) is alive.
2. drain_to_guest emits a synthesized SYN to GUEST_PORT — InboundAccept
   channel + process_pending_inbound_accepts + synthesize_inbound_syn
   (5.5b.2/5.5b.3/5.5b.4) all fired.
3. drain_to_guest emits the completing ACK after our SYN-ACK — the
   SynSent → Established arm (5.5b.1) fired.

Also adds parse_tcp_to_guest_full helper (superset of parse_tcp_to_guest
that also returns src/dst ports, needed to identify the ephemeral
high_port in the synthesized SYN).

No VM, no --ignored flag, completes in ~0.1 s.

* bench(network): port_forward_accept_latency — Phase 5.5b wall-clock baseline

Times the full inbound port-forward path: host TcpStream::connect →
listener thread accept() → mpsc channel → process_pending_inbound_accepts
→ synthesize_inbound_syn → drain_to_guest output. Bounded above by
PORT_FORWARD_POLL_INTERVAL (50ms).

Regressions in the inbound state machine or listener poll loop now
surface numerically against this baseline.

* chore(bench): add scripts/bench-compare.sh — phase comparison report

Compares HEAD against an arbitrary baseline ref using the two bench
harnesses: divan microbenches (cargo bench --bench network) and the
VM-backed wall-clock harness (voidbox-network-bench). Emits markdown
with absolute numbers + percent deltas, suitable for PR descriptions.

Replaces the scattered /tmp/baseline-network-phase*.json files with a
reproducible single entry-point.

* test(network): icmp_echo_returns_reply — probe + assert, no silent skip

Reviewer (Copilot) flagged that the previous skip-on-no-reply path
masked real ICMP regressions on hosts where unprivileged ICMP works.
Probe socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP) once: skip only on
EPERM/EACCES (sysctl net.ipv4.ping_group_range forbids it), assert
otherwise.

* fix(network-bench): skip failed iterations + drop guest-ping ICMP path

C1.3: measure_tcp_throughput_g2h and measure_bulk_throughput_g2h now
`continue` on guest nc non-zero exit so failed iterations don't skew
the reported mean.

C2.2: measure_icmp_rr_latency dropped its guest-side ping path. The
guest images intentionally omit /bin/ping (busybox-static lacks
CONFIG_FEATURE_PING_TYPE_DGRAM and SOCK_RAW would need root); the
function now returns None with a warn explaining the gap. Proper
host-driven measurement is tracked as a follow-up.

* bench(network): migrate from deprecated .poll() to drain_to_guest()

Aligns benches with the production RX path. The deprecated poll()
allocated a fresh Vec<Vec<u8>> per call; drain_to_guest appends to a
caller-owned buffer that's reused across iterations. CI now gates on
the same allocator pattern production code uses, removing avoidable
allocation overhead from the measurements.

Drops the file-level #![allow(deprecated)].

* chore(bench): bench-compare.sh — fall back without bench-helpers feature

When the baseline ref pre-dates a464fc1 (Phase 5.5b.1) it doesn't have
the `bench-helpers` cargo feature, and cargo errored out, dropping the
entire baseline divan section to —.

Detect the "does not have feature" / "unknown feature" stderr signal
and retry without --features bench-helpers. Benches that exist at both
refs get real Δ%; the bench-helpers-gated ones (synthesize_inbound_syn,
tcp_inbound_syn_ack_transition) naturally remain — for baseline.

Unlocks bench-compare.sh --baseline origin/main against the full branch
history.

* fix(network-bench): bound accept-thread lifetimes with deadlines

drain_one_connection, rr_echo_server, crr_echo_server now accept with
a deadline derived from LATENCY_RECV_TIMEOUT + slack. Previously each
spawned thread blocked forever on listener.accept() if the guest nc
never connected (exec error, network failure), holding the listener
FD across all subsequent iterations and burning thread/FD slots.

When the accept deadline lapses, the thread exits cleanly, the
listener drops, and the next iteration starts with a clean slate.

Addresses Copilot review C2.1, C2.5, C2.6.

* docs: Phase 6 overview plan — TCP lifecycle + async connect + windows + epoll

Scopes the four architectural follow-ups surfaced in the
smoltcp-passt-port PR review:

- 6.1 (high)  : TCP half-close (FinWait*/CloseWait/LastAck) — silent
                data loss on shutdown(SHUT_WR) today.
- 6.2 (med-h) : Async outbound connect — vCPU thread blocked up to
                3 s on slow destinations.
- 6.3 (med)   : Window management + scaling — guest window ignored;
                advertised window hardcoded 65535.
- 6.4 (med-l) : Event-driven RX polling — replace 5 ms timer with
                epoll_wait.

Locks the observability + cross-platform + snapshot invariants from
the top-level spec. Per-subsystem TDD task lists deferred to dedicated
plans (-phase6.1.md..-phase6.4.md) written before each kicks off.

* docs: Phase 6.4 detailed TDD plan — epoll-driven RX

14 bite-sized tasks covering the event-driven RX rewrite:

- Task 1: drain_n migration + tcp_writes_more_than_256kb_succeed
  retransmit-test fix-up (Copilot C1.1; correctly classified as
  VALID after re-reading; the previous "stale" verdict missed
  that seq advances unconditionally on line 398).
- Task 2: BROKEN_ON_PURPOSE pin tcp_rx_latency_sub_5ms.
- Tasks 3-6: EpollDispatch module — epoll_create1, register/
  unregister, wait_with_timeout, self-pipe Waker.
- Task 7: SlirpBackend holds EpollDispatch + Waker.
- Tasks 8-9: TCP/UDP/ICMP register-on-insert + unregister-on-remove.
- Task 10: relay_*_data dispatch by readiness instead of full-table
  iteration.
- Task 11: net_poll_thread rewritten to epoll_wait(50ms);
  BROKEN_ON_PURPOSE pin flips to passing.
- Task 12: snapshot rebuild path (epoll_fd doesn't survive snapshot).
- Task 13: tcp_rx_latency_one_packet bench + perf gate.
- Task 14: full validation contract.

Hard perf gate: scripts/bench-compare.sh --baseline origin/main
must show every comparable bench at HEAD ≤ baseline + 5%, AND
tcp_rx_latency_us_p50 < 5ms (was bounded below by 5ms timer
cycle), AND port_forward_accept_latency improves by ≥30% (or
documented exception).

Lives on branch smoltcp-passt-port-phase6.4-epoll, cut from PR #68's
tip (47868f0). Will become its own PR once the plan executes.

* test(network): drain_n via drain_to_guest + real retransmit in 256kb test

Two test-harness improvements landing together since both block the
Phase 6.4 RX-latency work:

- drain_n migrated from deprecated SlirpBackend::poll() to
  drain_to_guest. This was the last in-tree poll() caller.
- tcp_writes_more_than_256kb_succeed now matches its 'we re-send
  those' comment: seq only advances when acked_seq catches up,
  giving real TCP-retransmit semantics in the synthetic guest
  rather than the previous 'lossy with 95% tolerance' shape.
  Phase 6.4 must not regress this contract; making the test
  faithful first means epoll regressions surface as failures
  instead of borderline 95% misses.

* test(network): pin tcp_rx_latency_sub_5ms (BROKEN_ON_PURPOSE)

Phase 6.4 contract: host→guest RX latency must be sub-5 ms when
data is available. Pre-6.4 the floor is the 5 ms net_poll_thread
sleep cycle; this assertion fails on master and on the current
PR #68 tip. Phase 6.4's epoll dispatch will flip it to passing.

Mark with #[ignore] is deliberately NOT used: this is a positive
contract and CI must surface the failure on master so the gate
is unmissable.

* Revert "test(network): pin tcp_rx_latency_sub_5ms (BROKEN_ON_PURPOSE)"

This reverts commit 3e47ffbd2ba9f7edee64047145bf0e6ff3a0a811.

* docs(phase6.4): drop Task 2 unit-level pin — VMM-level contract instead

Implementer caught a planning error: the 5 ms host→guest latency
floor lives in src/vmm/mod.rs:1609 (net_poll_thread sleep), not in
SlirpBackend. drain_to_guest runs synchronously from a unit test,
so the floor never materializes there — the original Task 2 test
underflowed on subtraction and would never have observed what we
care about.

The contract — host→guest must deliver in < 5 ms — is preserved as
a VM-level requirement in Task 13's wall-clock tcp_rx_latency_us_p50
metric in voidbox-network-bench. The plan's hard-perf-gate already
requires it, so no contract is lost.

Reverts 3e47ffb (the broken unit pin) and amends the plan to mark
Task 2 as DROPPED with rationale.

* feat(network): EpollDispatch skeleton with epoll_create1

Phase 6.4 foundation. One epoll_fd owned via OwnedFd + EPOLL_CLOEXEC.
No registration logic yet — Task 4 will add register/unregister and
Task 6 will add the self-pipe + wait loop.

* feat(network): EpollDispatch register/unregister

* feat(network): EpollDispatch::wait_with_timeout

* feat(network): EpollDispatch self-pipe wakeup

Cloneable Waker writes one byte to a non-blocking pipe registered
with EPOLLIN. wait_with_timeout filters self-pipe events out of
the returned set and drains the pipe so subsequent waits don't
spurious-wake.

* refactor(slirp): SlirpBackend holds Arc<Mutex<EpollDispatch>> + Waker

Start with Arc<Mutex<EpollDispatch>> directly instead of plain
EpollDispatch so Task 11's poll-thread refactor (which needs the Arc
to share the same instance across threads) lands without a second
struct-field touch.

SlirpBackend gains two new fields:
- `epoll: Arc<Mutex<EpollDispatch>>` — the epoll instance
- `epoll_waker: Waker` — cloneable write-end of the self-pipe

Both fields are unused until Task 8 wires up the register/unregister
calls; #[allow(dead_code)] suppresses clippy -D warnings in the interim.
Targeted #[allow(dead_code)] attributes on unregister, wait_with_timeout,
EpollEvent, and Waker::wake/write_end cover the same interim period.

All 18 baseline pins pass; epoll_dispatch unit tests (5/5) pass.

* feat(slirp): register TCP flows with EpollDispatch

After each TCP flow-table insertion, register the host TcpStream FD
with EpollDispatch (EPOLLIN, token = flow_token_for_tcp) and call
epoll_waker.wake() to unblock any polling thread.  Before each
removal, unregister the FD to prevent dangling epoll registrations.

Three insertion sites covered:
- process_pending_inbound_accepts: port-forward SynSent entries
- handle_tcp_frame: outbound SYN → SynReceived entries
- insert_synthetic_synsent_entry (test helper): guarded with
  #[cfg(not(test))] so unit tests skip epoll side-effects

Two removal sites covered:
- relay_tcp_nat_data to_remove loop: Closed + timeout evictions
- handle_tcp_frame stale-entry purge before re-SYN

Token layout (flow_token_for_tcp):
  bits 63-56: 0x01 (PROTO_TAG_TCP)
  bits 47-32: guest_src_port
  bits 31-16: dst_port
  bits 15-0:  low 16 bits of dst_ip

PROTO_TAG_UDP, PROTO_TAG_ICMP, flow_token_for_udp, flow_token_for_icmp
are defined here (at module scope, project convention) with
#[allow(dead_code)] — Task 9 wires them immediately after.

All 18 baseline pins pass; 5/5 epoll unit tests pass.

* feat(slirp): register UDP + ICMP flows with EpollDispatch

Mirror Task 8's TCP pattern for the two remaining protocol families.

UDP (handle_udp_frame):
- On first datagram for a new (guest_src_port, dst_ip, dst_port) 3-tuple,
  record new_host_fd before the Entry API moves the UdpSocket into the
  flow table, then register(fd, flow_token_for_udp, EPOLLIN) + wake().
- relay_udp_flows idle-reaper: unregister(sock.as_raw_fd()) before remove.

ICMP (handle_icmp_frame):
- On first echo request for a new (guest_id, dst_ip) pair, record
  new_icmp_fd before the Entry API moves the socket, then register +
  wake() after the borrow ends.
- relay_icmp_echo idle-timeout eviction: look up sock fd and unregister
  before remove (borrow already dropped by then).

Token layouts:
  flow_token_for_udp:  0x02 | guest_src_port<<32 | dst_port<<16 | dst_ip_low16
  flow_token_for_icmp: 0x03 | guest_id<<32

All 18 baseline pins pass; 23/23 lib-network tests pass.
Bench compilation still clean.

* feat(slirp): relay loops dispatch by epoll readiness

drain_to_guest polls the EpollDispatch with a zero-duration timeout and
passes the resulting readiness set to the three relay methods
(relay_tcp_nat_data, relay_icmp_echo, relay_udp_flows).

Each relay now filters by protocol tag (PROTO_TAG_{TCP,UDP,ICMP}) and
only visits flows whose socket appears as EPOLLIN-ready in the event
set, avoiding O(flow_count) reads-on-every-tick.

relay_tcp_nat_data uses a two-pass design: Pass 1 sweeps all TCP entries
for Closed state and idle timeout unconditionally (so a guest FIN that
marks an entry Closed in handle_tcp_frame causes the host TcpStream to
drop promptly, giving the server-side reader an EOF); Pass 2 restricts
the peek/relay I/O to ready entries only.

epoll_arc() added to NetworkBackend trait (Linux cfg-gated, default
None) and overridden on SlirpBackend. VirtioNetDevice.epoll_arc()
delegates to the backend, enabling net_poll_thread (Task 11) to obtain
the shared Arc without an additional lock or refactor.

All 18 baseline pins pass.

* feat(vmm): net_poll_thread driven by epoll_wait

Replace the fixed 5 ms sleep with a blocking epoll_wait(50 ms) on the
EpollDispatch instance obtained from the network backend.

The thread wakes immediately when any registered host socket becomes
readable (relay loop runs at event time, not after a fixed delay) and
falls back to a 50 ms housekeeping tick when idle — preserving the UDP/
ICMP stale-flow reap path that was previously driven by the 5 ms sleep.

If the backend does not expose an epoll instance (non-SlirpBackend, e.g.
unit-test mocks), the thread keeps the original 5 ms sleep fallback.

All 18 baseline pins pass. Release build clean.

* feat(slirp): rebuild epoll set on snapshot restore

`epoll_fd` is a Linux kernel handle that does not survive snapshot:
after `MicroVm::from_snapshot` creates a fresh `SlirpBackend` via
`SlirpBackend::new()`, the new `EpollDispatch` starts with zero
registered FDs.  The current snapshot path does not reconstruct
`flow_table` — the backend always starts empty and new flows form
naturally — so the rebuild is a no-op today.  It is wired in advance
so Phase 6.1's half-close work (which will persist restored flows
across snapshot/restore) has a ready call site.

Changes:
- `EpollDispatch`: add `registered_count` field maintained by
  `register`/`unregister`; expose `registered_fd_count()` under
  `cfg(any(test, feature = "bench-helpers"))`.
- `SlirpBackend::rebuild_epoll_from_flow_table()`: iterates
  `flow_table` and re-registers each live host FD (`host_stream`,
  `sock` for UDP/ICMP) with the current dispatcher.
- `SlirpBackend::registered_fd_count()`: test/bench shim that
  delegates to `EpollDispatch::registered_fd_count()`.
- `SlirpBackend::reset_epoll_for_snapshot_test()`: replaces the
  epoll dispatcher with a fresh empty one, simulating the
  post-snapshot state (kernel handle gone) for unit-level smoke tests.
- `epoll_set_rebuilt_from_flow_table_smoke` in `network_baseline`:
  insert flow → reset epoll → assert count 0 → rebuild → assert
  count 1.

* fix(test): gate epoll_set_rebuilt smoke test on bench-helpers feature

The smoke test consumes #[cfg(any(test, feature = "bench-helpers"))]-
gated helpers (insert_synthetic_synsent_entry,
reset_epoll_for_snapshot_test, registered_fd_count). Integration
tests in tests/ don't get cfg(test) on the void-box library
crate — they only see #[cfg(feature = "bench-helpers")] items
when the feature is enabled.

Without this gate, default `cargo test --test network_baseline`
fails to compile with E0599 on the four helper methods. Now:

- Default cargo test → 18 pins pass, smoke test invisible.
- cargo test --features bench-helpers -- --test-threads=1 → 19
  pins pass, smoke test included.

The serial-run requirement is to side-step a pre-existing
parallel-run flake in tcp_port_forward_inbound_connect_succeeds
(host port-bind contention; not a Phase 6.4 regression).

* bench(network): tcp_rx_latency_one_packet — Phase 6.4 baseline

Divan microbench (`tcp_rx_latency_one_packet`) measures the SLIRP-layer
per-packet dispatch cost when one TCP flow is Established and the host
kernel has data ready: one zero-timeout epoll_wait + readiness scan +
peek + Ethernet frame construction.

Measured median on this host: ~9.8 µs per drain_to_guest call.

Pre-6.4 the relay iterated every flow in flow_table unconditionally
regardless of readiness.  Post-6.4 it dispatches only the flows with an
epoll EPOLLIN event, reducing wasted work on idle flows to zero.  This
bench is the regression anchor for that change.

The bench is gated on `--features bench-helpers` (like the existing
`tcp_inbound_syn_ack_transition` and `synthesize_inbound_syn` benches).
It performs a full 3-way handshake outside the timed loop so only the
hot relay path is measured.

Note: this bench cannot exercise the net_poll_thread 50 ms epoll cycle
(that thread does not run inside divan).  The wall-clock host→guest
latency floor is the province of voidbox-network-bench's
`tcp_rx_latency_us_p50` field.  That field is added to the Report struct
in this commit but returns None (deferred): wiring a guest-side listener
requires either a guest daemon or an additional exec RPC — both out of
scope for Phase 6.4.  The divan microbench is the primary numerical
deliverable for this phase.

* perf(slirp): eliminate epoll mutex contention via event queue

net_poll_thread holds the EpollDispatch mutex for the full 50 ms
of its blocking wait. drain_to_guest's own non-blocking
wait_with_timeout(ZERO) call contended on the same mutex,
serializing the vCPU thread behind the net-poll thread.
voidbox-network-bench saw TCP g2h throughput drop from ~1885 Mbps
to ~44 Mbps (40× regression).

Fix: SlirpBackend gets a small Mutex<Vec<EpollEvent>> queue.
net_poll_thread pushes events into it after each successful
wait_with_timeout. drain_to_guest drains the queue (brief
uncontended lock) without touching EpollDispatch. A try_lock
fallback path serves unit tests (no net_poll_thread) without
blocking on the mutex.

NetworkBackend trait gains a push_ready_events default-no-op so
SlirpBackend can override it; VirtioNetDevice exposes
push_events_to_backend as the trampoline called by net_poll_thread.

Off-CPU profile evidence: drain_to_guest was 9% off-CPU (29.7s
in a 60s window) waiting on the epoll mutex; should drop to
near-zero post-fix.

* perf(slirp): replace two-pass relay sweep with lazy close queue

relay_tcp_nat_data's Pass 1 unconditionally copied every TCP
FlowKey into a Vec to scan for Closed entries on every drain call.
Cache misses 47/1K under load; poll_with_n_flows/100 regressed
+246% (130ns → 450ns), /1000 regressed +220%.

Fix: when handle_tcp_frame's FIN/RST handlers and mid-function
error paths set state=Closed, push the key onto a pending_close
Vec. relay_tcp_nat_data drains this Vec at the top of its single
ready-events pass — no O(n) collect required. Idle-timeout
detection retains a direct flow_table iteration but without
allocating a separate key Vec.

* perf(slirp): prefer try_lock on epoll over pending_events in drain_to_guest

The initial Bug A fix used pending_events.lock() + try_lock(epoll)
in drain_to_guest's fast path, adding ~150ns overhead per call vs
Phase 6.4 (one extra Mutex acquire). This showed as +38% regression
in poll_idle bench (441ns → 611ns).

Revised approach: try_lock epoll first (zero cost when uncontended —
tests, benches, idle net-poll thread). On Err (net_poll_thread holds
the mutex for 50ms), drain pending_events instead. In production the
try_lock fails ~once per 50ms window; in tests it always succeeds.
Net result: drain_to_guest overhead matches Phase 6.4 when epoll is
uncontended; contention eliminated when net_poll_thread is actively
waiting.

* perf(slirp): wake net-poll thread when process_guest_frame queues frames

Pre-Phase-6.4, net_poll_thread woke unconditionally every 5 ms, so
every ACK queued in inject_to_guest by handle_tcp_frame got flushed
within 5 ms. Phase 6.4's epoll_wait(50 ms) waits for FD readiness
events — but a guest writing data has no FD-side signal (the guest
is the writer; the SLIRP-side socket only becomes readable when the
host responds). So queued ACKs sat 50 ms before being flushed; TCP
send window stalled; voidbox-network-bench TCP g2h dropped from
~1885 Mbps to ~225 Mbps even after the mutex-contention fix.

Fix: track inject_to_guest length around process_guest_frame's
ethertype dispatch.  If the call queued any frames, call
epoll_waker.wake() — one byte to the non-blocking self-pipe, which
unblocks net_poll_thread's epoll_wait so the queued frames flush
within microseconds.

Also fixes the related drain_to_guest event-source ordering bug:
pending_events (filled by net_poll_thread) is now ALWAYS drained
first, with the non-blocking epoll poll only running as a fallback
when the queue is empty (test/bench paths without net_poll_thread).
The previous code took the try_lock branch when net-poll was
between iterations and silently dropped events the net-poll
thread had already pushed.

voidbox-network-bench post-fix:
  g2h:        ~6000 Mbps  (vs master 1885; +3.2x)
  bulk-g2h:   ~3900 Mbps  (vs master 1565; +2.5x at SO_RCVBUF=4096)
  rr p50:     2 us        (parity with master)
  crr p50:    ~50 ms      (5x regression vs master ~10 ms — separate
                           bug, tracked in follow-up; the 50 ms is
                           exactly one epoll_wait cycle and points
                           to a connection-establishment latency
                           issue independent of the throughput path)

* perf(vmm): epoll_wait timeout 50ms → 5ms — restore CRR latency

CRR p50 was regressing +40 ms (10 ms → 51 ms) post-Phase-6.4.  The
+40 ms exactly matches Linux's TCP delayed-ACK timer, and the cause
is that Phase 6.4 widened the net-poll IRQ re-pulse cadence from
5 ms to 50 ms.

The Linux guest spends most idle time in HLT and relies on regular
vCPU scheduling slots — driven by our IRQ pulses — to advance its
TCP delayed-ACK timer.  At 50 ms cadence the guest's pure ACKs
ride the next event-triggered IRQ, which can be 40+ ms away.  At
5 ms the housekeeping cadence mirrors pre-6.4 and the timer fires
on schedule.

We lose Phase 6.4's headline "10x idle-wakeup reduction" goal but
fast-path events still wake immediately via epoll readiness — so
the net win vs master is unchanged: g2h throughput +250%, bulk
throughput +250%, RR parity, CRR parity.

voidbox-network-bench post-fix:
  g2h:        ~6500 Mbps  (vs master 1885; +247%)
  bulk-g2h:   ~5400 Mbps  (vs master 1565; +245%)
  rr p50:     ~3 us       (parity)
  crr p50:    ~10100 us   (parity — back to baseline 10 ms)

* chore: remove Phase-N references from inline + doc comments

Per AGENTS.md doc-comment style ("avoid ticket IDs and PR/commit
references inside doc comments and inline comments — they belong
in commit messages and PR descriptions where they're audit trail;
in code they age into noise as the ticketing context evolves").

Phase references fall into the same category.  Comments are
rewritten in present tense to explain the structural reasoning
without referencing when each piece landed.  Identifiers like
test names and BROKEN_ON_PURPOSE markers are unchanged.

Plan/spec docs in docs/superpowers/plans/ are intentionally
untouched — phase references there ARE the audit trail.

* perf(vmm): adaptive epoll_wait timeout (5 ms active / 50 ms idle)

Recovers Phase 6.4's headline 10x idle-wakeup reduction without
re-introducing the +40 ms CRR regression that forced the cadence
back to a fixed 5 ms.

The adaptive policy:
- last cycle had any kernel event   → next timeout 5 ms (active)
- last cycle timed out (no events) → next timeout 50 ms (idle)

A single quiet cycle drops us to idle; a single event puts us back
in active in the next cycle.

The subtlety that motivated the additional EpollDispatch change:
when the vCPU thread calls epoll_waker.wake() during a 50 ms idle
wait, the kernel's epoll_wait returns with the self-pipe event.
wait_with_timeout filters that event out and drains the pipe — so
`epoll_events.is_empty()` would have remained true, and the naive
"is_empty ⇒ idle" predicate kept us at 50 ms forever, regressing
CRR p50 back to ~50 ms.

wait_with_timeout now returns the *raw* kernel count (including
self-pipe wakes) so the adaptive policy treats wakes as activity.
Filtered events still arrive in the out parameter unchanged; only
the return value's meaning shifted from "observable count" to
"raw count," which all existing callers ignore.

voidbox-network-bench post-fix:
  g2h:        ~6680 Mbps  (vs 5 ms fixed: 6500; vs master: +254%)
  bulk-g2h:   ~5550 Mbps  (vs 5 ms fixed: 5400; vs master: +254%)
  rr p50:     1 us        (in 99-sample iteration; parity)
  crr p50:    ~10100 us   (parity preserved — adaptive correctly
                           holds 5 ms cadence during connection
                           bursts because each connection's wake()
                           keeps raw_kernel_events > 0)

Idle CPU dropped: profile-pre showed net_poll_thread on-CPU 4.93 %
of total at fixed 5 ms cadence (200 wakes/sec); adaptive should
drop to ~10x lower during idle stretches between iterations.

* fix(slirp): collision-safe flow tokens via monotonic counter

flow_token_for_tcp/udp truncated dst_ip to 16 bits; flow_token_for_icmp
omitted dst_ip entirely. Multiple flows could collide on the same
token, mis-routing readiness events to the wrong FlowKey.

Replace the lossy encoding with a monotonic AtomicU64 counter per
backend. Tokens are still tagged in the high byte for protocol demux
(PROTO_TAG_TCP/UDP/ICMP); the lower 56 bits are unique. A new
token_to_key HashMap makes readiness → FlowKey lookup O(1) instead
of the previous linear flow_table scan.

* fix(slirp): EpollDispatch lock-free — register/unregister never block

net_poll_thread held the Mutex<EpollDispatch> across the blocking
epoll_wait call (up to 50 ms in idle cadence). vCPU register/
unregister paths in handle_tcp_frame (and friends) had to acquire
the same mutex and would block behind the wait, stalling guest TCP
SYN handling for up to 50 ms during connection setup.

epoll_ctl and epoll_wait are kernel-thread-safe on the same epoll
fd; the only state requiring synchronization was the self-pipe
(now eagerly initialized in EpollDispatch::new) and the registered
fd count (now AtomicUsize). EpollDispatch becomes Sync without an
external Mutex — the type changes from Arc<Mutex<EpollDispatch>>
to Arc<EpollDispatch>. register/unregister run lock-free against
the wait thread; only the kernel's per-epoll-fd internal lock
serializes, and that's a fast path.

* perf(slirp): O(1) dedup in idle-timeout sweep via HashSet

to_remove.contains() inside the idle-timeout loop was O(n*k)
under churn. Switch the membership check to a HashSet<FlowKey>
and only materialize the Vec once at the end for the removal
loop.

* docs(vmm): net_poll_thread doc-comment matches adaptive timeout

* style: rust-style sweep on Phase 6.4 / Copilot-fix code

Apply project rust-style rules to the recently-landed Phase 6.4
code (token rewrite + lock-free EpollDispatch refactor):

1. RegisterMode enum replaces (readable: bool, writable: bool) on
   EpollDispatch::register. Closed-set policy at the call site
   (Read / Write / ReadWrite) over two opaque booleans.

2. matches!() removed at three sites — the project guide prefers
   full match (or boolean ==) for compiler diagnostics if the
   matched type changes. The unprivileged-ICMP errno check now
   uses == comparisons; FlowKey::Tcp counter uses a for loop.

3. Iterator chains in relay loops rewritten as for loops with
   mutable accumulators per the project rule. relay_tcp_nat_data,
   relay_udp_flows, relay_icmp_echo all touched. Logic unchanged;
   control flow now reads top-down without a chain of
   .filter().filter_map().collect().

4. Local renamed `rc` → `epoll_ctl_result` at three EpollDispatch
   sites. Role-bearing names are required in non-tiny scopes.

5. Dropped redundant explanatory comments around the relay loops
   ("Data relay — only for flows with…", "Skip entries already
   queued for…", "Collect ready ICMP flow keys via…"). The code
   below them is self-describing. Kept structural "why" comments
   (the ICMP idle-sweep rationale, the per-flow socket Drop
   contract).

No behavior change. cargo fmt, clippy -D warnings, network_baseline
(18/18), lib network (23/23), and voidbox-network-bench wall-clock
(g2h ~6580 Mbps, CRR ~32 µs) all green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants