Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
f2ba71a
tools: passt/pasta head-to-head comparison harness
dpsoft May 6, 2026
b633233
tools: crr-client + voidbox-side single-process CRR diagnostic
dpsoft May 6, 2026
f073ab9
tools: bench-qemu-slirp.sh — qemu+libslirp / qemu+passt CRR harness
dpsoft May 6, 2026
4ec59f9
perf(virtio-net): hot-path cleanups + suppress redundant IRQ pulses
dpsoft May 6, 2026
eef4aeb
perf(vmm): IRQ delivery via KVM_IRQFD instead of KVM_IRQ_LINE pair
dpsoft May 6, 2026
e08224d
perf(vmm): KVM_IOEVENTFD for virtio-net TX queue notify
dpsoft May 6, 2026
255eb74
perf(virtio-net): lock-free RX hand-off via SegQueue (Option B)
dpsoft May 6, 2026
e26a6bc
perf(virtio-net): interrupt_status as Arc<AtomicU32>
dpsoft May 6, 2026
c3b7f0a
tools: move perf-harness scripts under tools/perf-harness/
dpsoft May 6, 2026
3c5da08
fix(perf-harness): address Copilot AI review feedback
dpsoft May 6, 2026
8c0f49b
perf(slirp): hoist ready-event scratch Vec out of drain_to_guest
dpsoft May 6, 2026
08af859
fix(sandbox,bench): expose SLIRP rate-limit knobs for benches
dpsoft May 6, 2026
be3c37d
perf(virtio-net): reuse outer Vec across flush_pending_rx calls
dpsoft May 7, 2026
58abc71
perf(slirp): hoist relay_tcp_nat_data's frames_to_inject scratch
dpsoft May 7, 2026
aba8f85
perf(slirp): reuse flow-key scratch across TCP/ICMP/UDP relays
dpsoft May 7, 2026
a7e6296
fix(voidbox-network-bench): lift SLIRP rate limit for the CRR phase
dpsoft May 7, 2026
73059cc
docs: scope architectural perf experiments stacked on #81
dpsoft May 7, 2026
e4ff692
perf(slirp): scaffold io_uring batching primitive (feature-gated)
dpsoft May 7, 2026
b65080d
bench: add multi-flow concurrent CRR microbench
dpsoft May 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 16 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,17 @@ socket2 = { version = "0.5", features = ["all"] }
# path of a NAT keyed by guest-side ports the guest itself chooses.
rustc-hash = "2"

# Lock-free MPMC queue used to hand virtio-net RX frames from the
# net-poll thread to the vCPU thread without taking the
# `Arc<Mutex<VirtioNetDevice>>` device lock on the hot path.
crossbeam-queue = "0.3"

# Linux io_uring bindings. Gated behind the `io-uring` Cargo
# feature so the baseline epoll+read/write path remains the
# default; the experiment branch toggles this on to A/B against
# the user-space alloc reductions from PR #81.
io-uring = { version = "0.7", optional = true }

# --- macOS-only dependencies ---
[target.'cfg(target_os = "macos")'.dependencies]
# Objective-C 2.0 bindings (auto-generated from Apple frameworks)
Expand Down Expand Up @@ -144,6 +155,11 @@ opentelemetry = ["dep:opentelemetry", "dep:opentelemetry_sdk", "dep:opentelemetr
# Expose internal SlirpBackend helpers (insert_synthetic_synsent_entry, etc.)
# for use in benches/. Never enable in production builds.
bench-helpers = []
# Use io_uring for SLIRP host-socket recv/send batching instead of
# per-syscall read/write. Linux-only; falls back to the standard
# path on macOS or when the running kernel lacks io_uring support.
# Off by default while the experiment is being measured.
io-uring = ["dep:io-uring"]

[[bin]]
name = "voidbox"
Expand Down
96 changes: 96 additions & 0 deletions docs/passt-comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# passt head-to-head comparison harness

Tools under `tools/perf-harness/` produce a side-by-side comparison of voidbox
(real KVM VM + SLIRP) against passt's [`pasta`](https://passt.top/passt/about/)
running in a network namespace.

This is the deferred deliverable from
[`docs/superpowers/plans/2026-04-27-smoltcp-passt-port.md`](superpowers/plans/2026-04-27-smoltcp-passt-port.md)
§ "passt head-to-head methodology".

## What the harness measures

Both sides run the same workload shape — the same fields the
`voidbox-network-bench` `Report` already emits:

| Field | Workload |
|---|---|
| `tcp_throughput_g2h_mbps` | `dd if=/dev/zero bs=1M count=N \| nc HOST PORT` from inside the guest / netns; host TCP server times the drain |
| `tcp_rr_latency_us_p50/p99` | Persistent connection, host-side echo loop bouncing one byte per round trip |
| `tcp_crr_latency_us_p50` | Independent `nc` invocations in a tight loop; host-side timing of the full accept→read→write→close cycle |

The pasta side uses `pasta -- COMMAND` to run the client inside a fresh
network namespace. Pasta's `--map-host-loopback` (default: the host's
gateway IP) translates to the host's loopback, so the client connects
to `<host-gateway>:PORT` and reaches the host server bound on `127.0.0.1:PORT`.

## What it's good for

**CRR latency is the most apples-to-apples metric** — it's dominated by
NAT-table operations and the round-trip path through the user-mode
networking stack, which is the same code on both sides. Per the spec:

> Connect rate (CRR latency) is the most apples-to-apples metric —
> dominated by NAT-table operations, not MMIO. If passt does CRR in 135 µs
> and we do 600 µs, that's a meaningful "we have 4× more overhead per
> connect" signal that this refactor should narrow.

## What it's not

**Throughput numbers are not directly comparable.**

- voidbox runs a real KVM VM; every packet incurs `virtio-mmio`
exits, vCPU IPI overhead, and per-packet copy across the device
boundary.
- pasta runs in a network namespace; the data path is just user-mode
socket forwarding, no VM, no MMIO.

The throughput gap is therefore a *sum of the user-mode overhead the
two stacks share* plus *the VM transit cost only voidbox pays*.
Use the throughput numbers as a sanity bound, not a parity target.

A proper VM-vs-VM comparison would run passt under
`qemu-system-x86_64` with a guest image carrying `nc` / `iperf3`.
That is documented as a separate follow-up; the harness here is the
quick, low-friction sibling that exercises the apples-to-apples
metric (CRR) without requiring an extra guest image.

## Usage

```bash
# Generate voidbox numbers (requires VOID_BOX_KERNEL/VOID_BOX_INITRAMFS).
cargo run --release --bin voidbox-network-bench -- \
--iterations 3 --output /tmp/voidbox-bench.json

# Generate pasta numbers (requires pasta on PATH or via $PASTA).
tools/perf-harness/bench-pasta.py --output /tmp/pasta-bench.json

# Side-by-side markdown.
tools/perf-harness/bench-compare-pasta.py /tmp/voidbox-bench.json /tmp/pasta-bench.json \
--output /tmp/voidbox-vs-pasta.md

# qemu+libslirp / qemu+passt CRR (apples-to-apples SLIRP-vs-SLIRP).
gcc -O2 -static -o /tmp/crr-client tools/perf-harness/crr-client.c
tools/perf-harness/bench-qemu-slirp.sh --backend libslirp --iterations 30
tools/perf-harness/bench-qemu-slirp.sh --backend passt --iterations 30

# Voidbox single-process CRR (no per-iteration nc fork).
cargo run --release --example crr_singleproc_bench -- --iterations 30
```

`tools/perf-harness/bench-pasta.py --help` lists tunables (iterations,
transfer size, sample counts).

## Reading the report

| Δ column | Meaning |
|---|---|
| `voidbox N× faster` (throughput) | voidbox has the higher Mbps number |
| `voidbox N× slower` (throughput) | pasta has the higher Mbps number — expected, since pasta has no VM |
| `voidbox N× faster` (latency) | voidbox has the lower µs number |
| `voidbox N× slower` (latency) | pasta has the lower µs number — large multiples here mean voidbox spends much of its CRR time outside the NAT path (poll-thread cadence, vCPU exits, virtio handling) |

A useful CRR signal: if `voidbox N× slower on CRR p50` is much larger
than `voidbox N× slower on RR p50`, the per-connection overhead is the
bottleneck, not the data path. RR p50 captures the data path; CRR
captures the connect path.
115 changes: 115 additions & 0 deletions docs/perf-architectural-experiments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# SLIRP perf — architectural experiments

Stacked on top of #81. After the heaptrack-driven user-space alloc
reductions exhausted (-90% allocs/iter, p50 unchanged at ~275 µs), the
remaining wall-clock floor is dominated by:

1. **Kernel ↔ userspace transitions** — per-packet `read()`/`write()` on
host sockets, one syscall per packet, serial in `net_poll_thread`.
2. **Per-vCPU MMIO exits** for virtio doorbell writes (already partially
addressed by `KVM_IOEVENTFD` for TX-notify; RX-notify and other
queues still exit).
3. **Single-queue serialization** through `net_poll_thread`'s single
epoll loop, even with multi-vCPU guests.

This document tracks the architectural experiments that target those
floors, ranked by risk × payoff. Each experiment lands as its own
commit with a measurement vs the #81 baseline attached.

## Non-goal: TAP / passt-style host bypass

Dropping SLIRP and routing through TAP + an external passt instance
would close the latency gap to passt itself, but it would move the
DNS interception, port-forwarding, deny-list, and rate-limiting
feature surface out of voidbox into a separate process — and we lose
the in-process observability we currently get from instrumenting
SLIRP directly. **Full SLIRP-path observability is a hard
requirement**, so passt-style bypass is out of scope.

## Experiments

### 1. `io_uring` for SLIRP host-socket I/O — start here

**Current path:** per-flow `recv()` + `sendto()` on host sockets,
one syscall per packet, called from `net_poll_thread` in serial.
On CRR ~5 syscalls/iter; on bulk transfers it's the dominant cost.

**Proposal:** add an `io_uring` instance to the SLIRP backend,
side-by-side with the existing `EpollDispatch`:

- After each `epoll_wait`, submit a batched `IORING_OP_RECV` SQE
for every readable host socket — one SQE per flow with new
data, all submitted in a single syscall.
- Submit `IORING_OP_SEND` SQEs for the outbound frames the SLIRP
stack builds, again batched into a single submission.
- Drain CQEs in the relay loop instead of calling `recv` /
`sendto` directly.

**Expected:** ~10–30 µs CRR p50 reduction (5 syscalls per CRR
× ~3–5 µs/syscall × batching savings). Measurable via
`examples/crr_singleproc_bench`.

**Risk:** lowest — the change is localized to the relay layer's
read/write helpers. Falls back to the existing path behind a
build feature so we can A/B.

### 2. `splice()` / `sendfile()` zero-copy on bulk paths

**Current path:** guest virtio TX ring → vmm copies into Rust
`Vec<u8>` → SLIRP/smoltcp → kernel send buffer of host socket.
The middle copy is avoidable for direct-pipe flows where guest
payload is destined to a host TCP socket without header rewrites.

**Proposal:** `splice()` between the host-socket fd and a pipe (then
to next stage) eliminates one userspace copy. Only works for
fd-to-fd, so SLIRP NAT rewriting defeats it for the header path;
applies to the **payload bytes only** if we route header building
through smoltcp metadata and pipe just the bulk payload.

**Expected:** +10–20% throughput on `tcp_throughput_g2h_mbps`.
**Risk:** medium. Plumbing pipe fds through the relay state
machine is non-trivial; needs care around partial writes and
backpressure.

### 3. MSI-X virtio + multi-queue for vCPU scaling

**Current path:** virtio-net uses a single RX queue + single TX
queue, both serviced by `net_poll_thread`. With multi-vCPU
guests, the contention is on `net_poll_thread`'s single epoll
loop.

**Proposal:** add MSI-X support to `src/vmm/arch/x86_64/` (currently
INTx only) and expose `VIRTIO_NET_F_MQ` so the guest can spin up
per-CPU queue pairs. Host side fans out queues to multiple poll
threads, each on its own epoll instance.

**Expected:** +50–100% throughput on multi-vCPU sandboxes. No
impact on single-vCPU CRR microbenches.
**Risk:** highest of the three. Touches IRQ delivery, `KVM_IRQFD`
wiring, and the IRQ path is HW-feature-gated; CI workers without
MSI-X support need a fallback.

## Tooling

All experiments measured with the perf-harness from #81:

| Tool | Signal |
|---|---|
| `examples/crr_singleproc_bench` | CRR p50/p99 (real NAT path) |
| `voidbox-network-bench` | g2h throughput, RR p50/p99 |
| `heaptrack` | allocation regression check |
| `tools/perf-harness/bench-pasta.py` | pasta reference number |
| `tools/perf-harness/bench-qemu-slirp.sh` | qemu+libslirp / qemu+passt cross-check |

## Methodology

1. Each experiment is a single commit gated behind a Cargo feature
(`io-uring`, `splice-zerocopy`, `multi-queue`) so the baseline
can A/B against it without a revert.
2. Commit message includes the before/after numbers from
`crr_singleproc_bench --iterations 100` and
`voidbox-network-bench --iterations 3`.
3. heaptrack run after each commit confirms no alloc regression
vs the round-2 number from #81 (~41 allocs/iter on CRR).
4. If a commit doesn't move the needle, it's reverted before the
next experiment lands so the diff stays minimal.
Loading