the-void-ia · dpsoft · May 6, 2026 · May 6, 2026 · May 6, 2026 · May 6, 2026
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -113,6 +113,17 @@ socket2 = { version = "0.5", features = ["all"] }
 # path of a NAT keyed by guest-side ports the guest itself chooses.
 rustc-hash = "2"
 
+# Lock-free MPMC queue used to hand virtio-net RX frames from the
+# net-poll thread to the vCPU thread without taking the
+# `Arc<Mutex<VirtioNetDevice>>` device lock on the hot path.
+crossbeam-queue = "0.3"
+
+# Linux io_uring bindings.  Gated behind the `io-uring` Cargo
+# feature so the baseline epoll+read/write path remains the
+# default; the experiment branch toggles this on to A/B against
+# the user-space alloc reductions from PR #81.
+io-uring = { version = "0.7", optional = true }
+
 # --- macOS-only dependencies ---
 [target.'cfg(target_os = "macos")'.dependencies]
 # Objective-C 2.0 bindings (auto-generated from Apple frameworks)
@@ -144,6 +155,11 @@ opentelemetry = ["dep:opentelemetry", "dep:opentelemetry_sdk", "dep:opentelemetr
 # Expose internal SlirpBackend helpers (insert_synthetic_synsent_entry, etc.)
 # for use in benches/. Never enable in production builds.
 bench-helpers = []
+# Use io_uring for SLIRP host-socket recv/send batching instead of
+# per-syscall read/write.  Linux-only; falls back to the standard
+# path on macOS or when the running kernel lacks io_uring support.
+# Off by default while the experiment is being measured.
+io-uring = ["dep:io-uring"]
 
 [[bin]]
 name = "voidbox"

diff --git a/docs/passt-comparison.md b/docs/passt-comparison.md
@@ -0,0 +1,96 @@
+# passt head-to-head comparison harness
+
+Tools under `tools/perf-harness/` produce a side-by-side comparison of voidbox
+(real KVM VM + SLIRP) against passt's [`pasta`](https://passt.top/passt/about/)
+running in a network namespace.
+
+This is the deferred deliverable from
+[`docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md`](superpowers/plans/2026-04-27-smoltcp-passt-port.md)
+§ "passt head-to-head methodology".
+
+## What the harness measures
+
+Both sides run the same workload shape — the same fields the
+`voidbox-network-bench` `Report` already emits:
+
+| Field | Workload |
+|---|---|
+| `tcp_throughput_g2h_mbps` | `dd if=/dev/zero bs=1M count=N \| nc HOST PORT` from inside the guest / netns; host TCP server times the drain |
+| `tcp_rr_latency_us_p50/p99` | Persistent connection, host-side echo loop bouncing one byte per round trip |
+| `tcp_crr_latency_us_p50` | Independent `nc` invocations in a tight loop; host-side timing of the full accept→read→write→close cycle |
+
+The pasta side uses `pasta -- COMMAND` to run the client inside a fresh
+network namespace.  Pasta's `--map-host-loopback` (default: the host's
+gateway IP) translates to the host's loopback, so the client connects
+to `<host-gateway>:PORT` and reaches the host server bound on `127.0.0.1:PORT`.
+
+## What it's good for
+
+**CRR latency is the most apples-to-apples metric** — it's dominated by
+NAT-table operations and the round-trip path through the user-mode
+networking stack, which is the same code on both sides.  Per the spec:
+
+> Connect rate (CRR latency) is the most apples-to-apples metric —
+> dominated by NAT-table operations, not MMIO. If passt does CRR in 135 µs
+> and we do 600 µs, that's a meaningful "we have 4× more overhead per
+> connect" signal that this refactor should narrow.
+
+## What it's not
+
+**Throughput numbers are not directly comparable.**
+
+- voidbox runs a real KVM VM; every packet incurs `virtio-mmio`
+  exits, vCPU IPI overhead, and per-packet copy across the device
+  boundary.
+- pasta runs in a network namespace; the data path is just user-mode
+  socket forwarding, no VM, no MMIO.
+
+The throughput gap is therefore a *sum of the user-mode overhead the
+two stacks share* plus *the VM transit cost only voidbox pays*.
+Use the throughput numbers as a sanity bound, not a parity target.
+
+A proper VM-vs-VM comparison would run passt under
+`qemu-system-x86_64` with a guest image carrying `nc` / `iperf3`.
+That is documented as a separate follow-up; the harness here is the
+quick, low-friction sibling that exercises the apples-to-apples
+metric (CRR) without requiring an extra guest image.
+
+## Usage
+
+```bash
+# Generate voidbox numbers (requires VOID_BOX_KERNEL/VOID_BOX_INITRAMFS).
+cargo run --release --bin voidbox-network-bench -- \
+    --iterations 3 --output /tmp/voidbox-bench.json
+
+# Generate pasta numbers (requires pasta on PATH or via $PASTA).
+tools/perf-harness/bench-pasta.py --output /tmp/pasta-bench.json
+
+# Side-by-side markdown.
+tools/perf-harness/bench-compare-pasta.py /tmp/voidbox-bench.json /tmp/pasta-bench.json \
+    --output /tmp/voidbox-vs-pasta.md
+
+# qemu+libslirp / qemu+passt CRR (apples-to-apples SLIRP-vs-SLIRP).
+gcc -O2 -static -o /tmp/crr-client tools/perf-harness/crr-client.c
+tools/perf-harness/bench-qemu-slirp.sh --backend libslirp --iterations 30
+tools/perf-harness/bench-qemu-slirp.sh --backend passt    --iterations 30
+
+# Voidbox single-process CRR (no per-iteration nc fork).
+cargo run --release --example crr_singleproc_bench -- --iterations 30
+```
+
+`tools/perf-harness/bench-pasta.py --help` lists tunables (iterations,
+transfer size, sample counts).
+
+## Reading the report
+
+| Δ column | Meaning |
+|---|---|
+| `voidbox N× faster`  (throughput) | voidbox has the higher Mbps number |
+| `voidbox N× slower`  (throughput) | pasta has the higher Mbps number — expected, since pasta has no VM |
+| `voidbox N× faster`  (latency)    | voidbox has the lower µs number |
+| `voidbox N× slower`  (latency)    | pasta has the lower µs number — large multiples here mean voidbox spends much of its CRR time outside the NAT path (poll-thread cadence, vCPU exits, virtio handling) |
+
+A useful CRR signal: if `voidbox N× slower on CRR p50` is much larger
+than `voidbox N× slower on RR p50`, the per-connection overhead is the
+bottleneck, not the data path.  RR p50 captures the data path; CRR
+captures the connect path.
diff --git a/docs/perf-architectural-experiments.md b/docs/perf-architectural-experiments.md
@@ -0,0 +1,115 @@
+# SLIRP perf — architectural experiments
+
+Stacked on top of #81.  After the heaptrack-driven user-space alloc
+reductions exhausted (-90% allocs/iter, p50 unchanged at ~275 µs), the
+remaining wall-clock floor is dominated by:
+
+1. **Kernel ↔ userspace transitions** — per-packet `read()`/`write()` on
+   host sockets, one syscall per packet, serial in `net_poll_thread`.
+2. **Per-vCPU MMIO exits** for virtio doorbell writes (already partially
+   addressed by `KVM_IOEVENTFD` for TX-notify; RX-notify and other
+   queues still exit).
+3. **Single-queue serialization** through `net_poll_thread`'s single
+   epoll loop, even with multi-vCPU guests.
+
+This document tracks the architectural experiments that target those
+floors, ranked by risk × payoff.  Each experiment lands as its own
+commit with a measurement vs the #81 baseline attached.
+
+## Non-goal: TAP / passt-style host bypass
+
+Dropping SLIRP and routing through TAP + an external passt instance
+would close the latency gap to passt itself, but it would move the
+DNS interception, port-forwarding, deny-list, and rate-limiting
+feature surface out of voidbox into a separate process — and we lose
+the in-process observability we currently get from instrumenting
+SLIRP directly.  **Full SLIRP-path observability is a hard
+requirement**, so passt-style bypass is out of scope.
+
+## Experiments
+
+### 1. `io_uring` for SLIRP host-socket I/O — start here
+
+**Current path:** per-flow `recv()` + `sendto()` on host sockets,
+one syscall per packet, called from `net_poll_thread` in serial.
+On CRR ~5 syscalls/iter; on bulk transfers it's the dominant cost.
+
+**Proposal:** add an `io_uring` instance to the SLIRP backend,
+side-by-side with the existing `EpollDispatch`:
+
+- After each `epoll_wait`, submit a batched `IORING_OP_RECV` SQE
+  for every readable host socket — one SQE per flow with new
+  data, all submitted in a single syscall.
+- Submit `IORING_OP_SEND` SQEs for the outbound frames the SLIRP
+  stack builds, again batched into a single submission.
+- Drain CQEs in the relay loop instead of calling `recv` /
+  `sendto` directly.
+
+**Expected:** ~10–30 µs CRR p50 reduction (5 syscalls per CRR
+× ~3–5 µs/syscall × batching savings).  Measurable via
+`examples/crr_singleproc_bench`.
+
+**Risk:** lowest — the change is localized to the relay layer's
+read/write helpers.  Falls back to the existing path behind a
+build feature so we can A/B.
+
+### 2. `splice()` / `sendfile()` zero-copy on bulk paths
+
+**Current path:** guest virtio TX ring → vmm copies into Rust
+`Vec<u8>` → SLIRP/smoltcp → kernel send buffer of host socket.
+The middle copy is avoidable for direct-pipe flows where guest
+payload is destined to a host TCP socket without header rewrites.
+
+**Proposal:** `splice()` between the host-socket fd and a pipe (then
+to next stage) eliminates one userspace copy.  Only works for
+fd-to-fd, so SLIRP NAT rewriting defeats it for the header path;
+applies to the **payload bytes only** if we route header building
+through smoltcp metadata and pipe just the bulk payload.
+
+**Expected:** +10–20% throughput on `tcp_throughput_g2h_mbps`.
+**Risk:** medium.  Plumbing pipe fds through the relay state
+machine is non-trivial; needs care around partial writes and
+backpressure.
+
+### 3. MSI-X virtio + multi-queue for vCPU scaling
+
+**Current path:** virtio-net uses a single RX queue + single TX
+queue, both serviced by `net_poll_thread`.  With multi-vCPU
+guests, the contention is on `net_poll_thread`'s single epoll
+loop.
+
+**Proposal:** add MSI-X support to `src/vmm/arch/x86_64/` (currently
+INTx only) and expose `VIRTIO_NET_F_MQ` so the guest can spin up
+per-CPU queue pairs.  Host side fans out queues to multiple poll
+threads, each on its own epoll instance.
+
+**Expected:** +50–100% throughput on multi-vCPU sandboxes.  No
+impact on single-vCPU CRR microbenches.
+**Risk:** highest of the three.  Touches IRQ delivery, `KVM_IRQFD`
+wiring, and the IRQ path is HW-feature-gated; CI workers without
+MSI-X support need a fallback.
+
+## Tooling
+
+All experiments measured with the perf-harness from #81:
+
+| Tool | Signal |
+|---|---|
+| `examples/crr_singleproc_bench` | CRR p50/p99 (real NAT path) |
+| `voidbox-network-bench` | g2h throughput, RR p50/p99 |
+| `heaptrack` | allocation regression check |
+| `tools/perf-harness/bench-pasta.py` | pasta reference number |
+| `tools/perf-harness/bench-qemu-slirp.sh` | qemu+libslirp / qemu+passt cross-check |
+
+## Methodology
+
+1. Each experiment is a single commit gated behind a Cargo feature
+   (`io-uring`, `splice-zerocopy`, `multi-queue`) so the baseline
+   can A/B against it without a revert.
+2. Commit message includes the before/after numbers from
+   `crr_singleproc_bench --iterations 100` and
+   `voidbox-network-bench --iterations 3`.
+3. heaptrack run after each commit confirms no alloc regression
+   vs the round-2 number from #81 (~41 allocs/iter on CRR).
+4. If a commit doesn't move the needle, it's reverted before the
+   next experiment lands so the diff stays minimal.