Phase 6.1: TCP half-close — fix silent data loss on shutdown(SHUT_WR)#73
Closed
dpsoft wants to merge 10 commits intoport-forward-listener-epollfrom
Closed
Phase 6.1: TCP half-close — fix silent data loss on shutdown(SHUT_WR)#73dpsoft wants to merge 10 commits intoport-forward-listener-epollfrom
dpsoft wants to merge 10 commits intoport-forward-listener-epollfrom
Conversation
Each TCP port-forward rule used to burn one thread that polled TcpListener::accept() with 50 ms sleeps between WouldBlock returns. That thread was both the accept-latency floor and the only piece of networking still polling on a fixed cadence after Phase 6.4. Listener FDs are exactly what EpollDispatch handles. Bind + register the listener under PROTO_TAG_LISTEN at construction time; the net-poll thread sees readiness, accepts the connection (drains WouldBlock), and feeds the existing InboundAccept channel. No dedicated thread per rule. Drops port_forward_accept_latency from ~50 ms to sub-millisecond (divan microbench is host TcpStream::connect → first frame in inject_to_guest, now bounded by epoll_wait latency in the active 5 ms cadence rather than a fixed 50 ms poll). Removes: - run_port_forward_listener (thread main loop) - spawn_port_forward_listeners (now bind_port_forward_listeners) - PORT_FORWARD_POLL_INTERVAL - port_forward_shutdown Arc<AtomicBool> - the Drop impl block that joined listener threads
9 bite-sized tasks covering the half-close state machine: - Established → FinWait1 (guest closed first) → LastAck on host EOF - Established → CloseWait (host closed first) → LastAck on guest FIN - LastAck → Closed on guest's final ACK or LAST_ACK_TIMEOUT (60s) Three new pins: tcp_half_close_guest_writes_first (BROKEN_ON_PURPOSE flips at Task 3), tcp_half_close_host_writes_first (passes after Task 3), tcp_last_ack_timeout_reaps_stale_entry (bench-helpers gated, Task 7). Severity: HIGH — current code drops host's pending response on guest shutdown(SHUT_WR), causing silent data loss for HTTP, SMTP, and any protocol with orderly half-close.
Guest sends FIN after data; current code marks state=Closed immediately on guest FIN, so the host's response data is dropped. This pin locks the broken behavior so the Phase 6.1 fix is legible to reviewers. Flips to PASS when Tasks 2–3 implementation lands.
Tasks 2+3+4 implemented together: clippy's -D warnings requires our_fin_sent to be used before committing the field addition, so the relay loop changes from Task 3 and the ACK handler from Task 4 are bundled with Task 2's struct changes. Structural changes (TcpNatEntry): - Add last_state_change: Instant — tracks when state last changed, used for LAST_ACK_TIMEOUT reaping (Task 6). - Add our_fin_sent: bool — prevents re-sending FIN on repeated epoll readiness events for the same half-close transition. - Initialize both fields at all three TcpNatEntry literal sites. FIN handler (handle_tcp_frame): - Established → FinWait1: ACK guest FIN, shutdown(Write) host socket so host sees EOF; stay alive for host's pending response data. - CloseWait → LastAck: host already closed; guest just closed; ACK and wait for guest's final ACK of our FIN. ACK handler (handle_tcp_frame): - LastAck → Closed: guest's final ACK reaps the entry. - Extend ACK-driven consume to FinWait1 state so bytes_in_flight drains and recv_peek eventually sees Ok(0) (host EOF). Relay loop (relay_tcp_nat_data): - Extend data relay to FinWait1 in addition to Established. - Ok(0) arm now dispatches on state: Established → CloseWait, FinWait1 → LastAck (instead of immediate Closed). - FIN-emit logic uses our_fin_sent guard and needs_fin predicate (CloseWait | LastAck) to send FIN exactly once per transition. Test: update tcp_half_close_guest_writes_first to ACK received data segments so the ACK-driven consume path drains the kernel buffer and recv_peek sees EOF. Without ACKs the FinWait1 relay sees the same bytes on every peek and never reaches Ok(0).
Symmetric mirror of tcp_half_close_guest_writes_first: host writes GREETING, shuts down its write side (Established → CloseWait), and waits for guest to reply; guest sends REPLY data + FIN. Also fix two implementation gaps found while writing this pin: 1. Guest→host data forwarding in CloseWait: the payload forward guard was Established-only. CloseWait must also forward guest data (host closed its write side but can still read). Extend the predicate to include CloseWait. 2. shutdown(Write) on CloseWait → LastAck: when the guest sends FIN in CloseWait the host application is blocked on read_to_end. We must call shutdown(Write) on host_stream so the kernel delivers EOF to the host. Without this the host blocks forever.
Add LAST_ACK_TIMEOUT = 60 s (TCP MSL × 2) as a module-scope constant. Merge the LastAck-timeout check into the existing idle-timeout sweep in relay_tcp_nat_data, making a single O(n) pass handle both conditions: - Standard idle-timeout (300 s of inactivity, any state): unchanged. - LastAck-timeout (60 s since last_state_change, LastAck state): reaps entries where the guest's final ACK never arrived, e.g. after a guest crash. Logs at WARN so operators can connect the timeout to a prior half-close sequence without a separate debugging session. The combined sweep avoids a second loop and an extra to_remove Vec allocation. Using last_state_change (set on every state transition) rather than last_activity (set on data relay) gives an accurate 60 s window from the moment we sent our FIN.
…ers) Add a bench-helpers–gated pin that verifies the LAST_ACK_TIMEOUT reaper: 1. Insert a synthetic LastAck entry via the new bench-helpers helper insert_synthetic_lastack_entry (TcpNatState::LastAck, our_fin_sent=true). 2. Back-date last_state_change by 70 s (> LAST_ACK_TIMEOUT = 60 s) via set_synthetic_last_state_change. 3. One drain_to_guest cycle runs the timeout sweep. 4. Assert the entry is gone (tcp_flow_state returns None). Two new bench-helpers methods on SlirpBackend: - insert_synthetic_lastack_entry: seeds a LastAck flow without a full half-close exchange, for timeout reaping tests. - set_synthetic_last_state_change: back-dates last_state_change on an existing entry to simulate an expired LAST_ACK_TIMEOUT. Also widen TcpNatState and tcp_flow_state to pub so the bench-helpers– gated external test can reference them by path. Both were previously pub(crate); they are now observable as part of the bench-helpers surface.
…FinWait2 All TcpNatState variants are now wired into the state machine: - SynReceived, SynSent: handshake paths (unchanged) - Established: normal data transfer (unchanged) - FinWait1: guest half-closed; relay continues forwarding host response - CloseWait: host half-closed; relay continues forwarding guest data - LastAck: both sides closed; waiting for guest's final ACK or timeout - Closed: entry pending removal Remove FinWait2: distinguishing it from FinWait1 would require observing per-segment ACKs from the kernel — the relay does not track those. The relay stays in FinWait1 until host EOF, then jumps to LastAck. If the distinction is needed later, the variant can be re-added. Update the TcpNatState doc comment to document the state machine diagram and the FinWait2 omission rationale, replacing the old per-variant stubs.
fmt: cargo fmt --all -- --check PASS
clippy: cargo clippy --workspace --all-targets --all-features PASS
tests: cargo test -p void-box --all-features PASS (13 unit tests)
cargo test --test network_baseline PASS (20 tests)
cargo test --test network_baseline --features bench-helpers
PASS (22 tests)
cargo test --lib network::slirp PASS (7 tests)
bench: cargo bench --bench network --features bench-helpers --no-run
PASS (compiles)
build: cargo build --release PASS
Note: cargo test --workspace --all-features fails on guest-agent because
the test binary requires write access to /workspace (pre-existing
environment constraint, not introduced by this branch). All other crates
pass. The AGENTS.md validation contract does not gate on guest-agent
passing in non-root environments.
VM suite (conformance, oci_integration, snapshot_integration) skipped:
VOID_BOX_KERNEL and VOID_BOX_INITRAMFS not set in this environment.
Two artifacts from the rebase onto the new #69 tip: 1. bind_port_forward_listeners was called twice in with_security after the rebase — the second call shadowed the first, which was harmless but emitted an unused-variable warning under -D warnings. Drop the duplicate. 2. The bench-helpers helper insert_synthetic_lastack_entry was missing flow_token in its TcpNatEntry initializer (the field was added by the Copilot fix #1 monotonic-token rewrite that landed on #69 after the original phase 6.1 commit chain was written). Allocate via next_flow_token(PROTO_TAG_TCP).
There was a problem hiding this comment.
Pull request overview
This PR extends the SLIRP TCP relay’s close handling so host/guest half-closes are tracked explicitly instead of being treated as an immediate full close. It fits into the ongoing smoltcp/passt networking work by addressing silent data loss during shutdown(SHUT_WR).
Changes:
- Adds half-close TCP states/transitions in
SlirpBackendfor guest-first and host-first shutdown paths. - Adds
LAST_ACK_TIMEOUTcleanup for staleLastAckflows. - Adds baseline pins and an implementation plan covering the new half-close behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
src/network/slirp.rs |
Implements the TCP half-close state machine, FIN handling, and LastAck timeout reaping. |
tests/network_baseline.rs |
Adds regression pins for guest-first and host-first half-close flows plus stale LastAck cleanup. |
docs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.1.md |
Adds the detailed Phase 6.1 implementation and validation plan. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1853
to
+1860
| TcpNatState::Established => { | ||
| entry.guest_ack = seq.wrapping_add(1); | ||
|
|
||
| // ACK the guest's FIN — but don't send our own FIN yet. Host | ||
| // application may have data still to send. We transition to | ||
| // FinWait1 and shut down the host socket's write side so the | ||
| // host knows no more data is coming from the guest. | ||
| let ack_frame = build_tcp_packet_static( |
Comment on lines
+1892
to
+1900
| TcpNatState::CloseWait => { | ||
| // Host already closed its write side; guest just closed | ||
| // too. Shut down our write side of host_stream so the | ||
| // host application's read sees EOF, ACK the guest's FIN, | ||
| // and transition to LastAck waiting for guest's final ACK | ||
| // of our FIN (which was already sent when we entered | ||
| // CloseWait). | ||
| entry.guest_ack = seq.wrapping_add(1); | ||
| let ack_frame = build_tcp_packet_static( |
| dst_ip, | ||
| dst_port | ||
| ); | ||
| return Ok(()); |
Comment on lines
+1706
to
+1713
| // ACK while in LastAck — guest acknowledged our FIN. Reap. | ||
| // Placed before the SynReceived ACK branch to be explicit (the | ||
| // states are mutually exclusive, but explicit ordering is clearer). | ||
| if tcp.ack() && entry.state == TcpNatState::LastAck { | ||
| debug!("SLIRP TCP: LastAck → Closed for {}:{}", dst_ip, dst_port); | ||
| entry.state = TcpNatState::Closed; | ||
| self.pending_close.push(flow_key); | ||
| return Ok(()); |
| | `tests/network_baseline.rs` | Three new pins. | | ||
| | Snapshot serde | Verify the new state transitions round-trip; if not, document the cleanup-on-restore behavior. | | ||
|
|
||
| No public API change. |
| - Verify the ACK path correctly transitions LastAck → Closed without consuming a SYN-state ACK by mistake. | ||
| - Verify `last_state_change` is updated on every state transition, not just initial. | ||
| - Verify guest writes during CloseWait are still relayed to host (host has full read side; only its write side is closed, which it learns from our shutdown — but writes from us to it are valid). | ||
| - Snapshot interaction: `TcpNatState` serde already exists; new fields (`last_state_change`, `our_fin_sent`) need `#[serde(default)]` for backward compatibility with pre-6.1 snapshots. |
Comment on lines
+1297
to
+1299
| /// BROKEN_ON_PURPOSE: guest sends FIN after data; current code marks state=Closed | ||
| /// immediately on guest FIN, so host's response data is dropped. | ||
| /// Flips to PASS when Task 2–3 implementation lands. |
Comment on lines
+1357
to
+1379
| // Guest sends "HELLO" data + FIN. | ||
| let request = b"HELLO"; | ||
| stack | ||
| .process_guest_frame(&build_tcp_frame( | ||
| SLIRP_GATEWAY_IP, | ||
| GUEST_EPHEMERAL_PORT, | ||
| host_port, | ||
| our_seq + 1, | ||
| gateway_seq + 1, | ||
| TcpControl::Psh, | ||
| request, | ||
| )) | ||
| .unwrap(); | ||
| stack | ||
| .process_guest_frame(&build_tcp_frame( | ||
| SLIRP_GATEWAY_IP, | ||
| GUEST_EPHEMERAL_PORT, | ||
| host_port, | ||
| our_seq + 1 + request.len() as u32, | ||
| gateway_seq + 1, | ||
| TcpControl::Fin, | ||
| &[], | ||
| )) |
Comment on lines
+1555
to
+1578
| // Guest sends reply data + FIN after receiving host's greeting and FIN. | ||
| // The relay is now in CloseWait; guest data should still be forwarded to host. | ||
| let reply = b"REPLY"; | ||
| stack | ||
| .process_guest_frame(&build_tcp_frame( | ||
| SLIRP_GATEWAY_IP, | ||
| GUEST_EPHEMERAL_PORT, | ||
| host_port, | ||
| our_seq + 1, | ||
| gateway_seq_next.wrapping_add(1), | ||
| TcpControl::Psh, | ||
| reply, | ||
| )) | ||
| .unwrap(); | ||
| stack | ||
| .process_guest_frame(&build_tcp_frame( | ||
| SLIRP_GATEWAY_IP, | ||
| GUEST_EPHEMERAL_PORT, | ||
| host_port, | ||
| our_seq + 1 + reply.len() as u32, | ||
| gateway_seq_next.wrapping_add(1), | ||
| TcpControl::Fin, | ||
| &[], | ||
| )) |
| #[derive(Debug, Clone, Copy, PartialEq)] | ||
| #[allow(dead_code)] | ||
| pub(crate) enum TcpNatState { | ||
| pub enum TcpNatState { |
Contributor
Author
|
superseded by #76 |
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status
DRAFT — stacked on #72 (port-forward listener on epoll), which is stacked on #69 (Phase 6.4 epoll).
Severity
HIGH — correctness fix; silent data loss. When the host closes its write-side
with
shutdown(SHUT_WR), the current code treats EOF as a full close andimmediately drops any data the guest may still write. The half-close state
machine in this PR preserves the guest's write path until the guest also sends
FIN.
Commits (9)
docs: Phase 6.1 detailed TDD plan — TCP half-closetest(network): pin tcp_half_close_guest_writes_first (BROKEN_ON_PURPOSE)—documents the current broken behavior; flips to PASS when implementation lands.
feat(slirp): FinWait1 → LastAck and Established → CloseWait on host EOF—core half-close state machine; host-EOF path transitions to FinWait1/CloseWait
and emits a FIN to the guest without tearing down the guest→host direction.
test(network): pin tcp_half_close_host_writes_firstfeat(slirp): LAST_ACK_TIMEOUT reaping prevents LastAck entry leak—60 s reaping for entries stuck in LastAck (guest never sends final ACK).
test(network): pin tcp_last_ack_timeout_reaps_stale_entry (bench-helpers)refactor(slirp): drop allow(dead_code) on TcpNatState; remove unused FinWait2chore(slirp): Phase 6.1 validation gate — all gates greenNew test pins
tcp_half_close_guest_writes_firsttcp_half_close_host_writes_firsttcp_last_ack_timeout_reaps_stale_entryValidation matrix (post-rebase)
cargo fmt --all -- --checkcargo clippy --workspace --all-targets --all-features -- -D warningscargo test --test network_baseline(default)cargo test --test network_baseline --features bench-helpers -- --test-threads=1cargo test --lib networkcargo build --releaseWall-clock bench results (post-rebase,
--iterations 3 --bulk-mb 10){ "tcp_bulk_throughput_g2h_mbps": 4864.9, "tcp_throughput_g2h_mbps": 5926.2, "tcp_rr_latency_us_p50": 4, "tcp_rr_latency_us_p99": 37, "tcp_crr_latency_us_p50": 10138 }Throughput and RR/CRR latency within expected range.