Skip to content

Phase 6.1: TCP half-close — fix silent data loss on shutdown(SHUT_WR)#73

Closed
dpsoft wants to merge 10 commits intoport-forward-listener-epollfrom
smoltcp-passt-port-phase6.1-half-close
Closed

Phase 6.1: TCP half-close — fix silent data loss on shutdown(SHUT_WR)#73
dpsoft wants to merge 10 commits intoport-forward-listener-epollfrom
smoltcp-passt-port-phase6.1-half-close

Conversation

@dpsoft
Copy link
Copy Markdown
Contributor

@dpsoft dpsoft commented May 4, 2026

Status

DRAFT — stacked on #72 (port-forward listener on epoll), which is stacked on #69 (Phase 6.4 epoll).

Severity

HIGH — correctness fix; silent data loss. When the host closes its write-side
with shutdown(SHUT_WR), the current code treats EOF as a full close and
immediately drops any data the guest may still write. The half-close state
machine in this PR preserves the guest's write path until the guest also sends
FIN.

Commits (9)

  1. docs: Phase 6.1 detailed TDD plan — TCP half-close
  2. test(network): pin tcp_half_close_guest_writes_first (BROKEN_ON_PURPOSE)
    documents the current broken behavior; flips to PASS when implementation lands.
  3. feat(slirp): FinWait1 → LastAck and Established → CloseWait on host EOF
    core half-close state machine; host-EOF path transitions to FinWait1/CloseWait
    and emits a FIN to the guest without tearing down the guest→host direction.
  4. test(network): pin tcp_half_close_host_writes_first
  5. feat(slirp): LAST_ACK_TIMEOUT reaping prevents LastAck entry leak
    60 s reaping for entries stuck in LastAck (guest never sends final ACK).
  6. test(network): pin tcp_last_ack_timeout_reaps_stale_entry (bench-helpers)
  7. refactor(slirp): drop allow(dead_code) on TcpNatState; remove unused FinWait2
  8. chore(slirp): Phase 6.1 validation gate — all gates green

New test pins

Test Feature gate Status
tcp_half_close_guest_writes_first default currently BROKEN_ON_PURPOSE (Task 2–3 not landed)
tcp_half_close_host_writes_first default PASS
tcp_last_ack_timeout_reaps_stale_entry bench-helpers PASS

Validation matrix (post-rebase)

Gate Result
cargo fmt --all -- --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo test --test network_baseline (default) 20/20 ✓
cargo test --test network_baseline --features bench-helpers -- --test-threads=1 22/22 ✓
cargo test --lib network 23/23 ✓
cargo build --release

Wall-clock bench results (post-rebase, --iterations 3 --bulk-mb 10)

{
  "tcp_bulk_throughput_g2h_mbps": 4864.9,
  "tcp_throughput_g2h_mbps": 5926.2,
  "tcp_rr_latency_us_p50": 4,
  "tcp_rr_latency_us_p99": 37,
  "tcp_crr_latency_us_p50": 10138
}

Throughput and RR/CRR latency within expected range.

dpsoft added 10 commits May 4, 2026 15:46
Each TCP port-forward rule used to burn one thread that polled
TcpListener::accept() with 50 ms sleeps between WouldBlock returns.
That thread was both the accept-latency floor and the only piece of
networking still polling on a fixed cadence after Phase 6.4.

Listener FDs are exactly what EpollDispatch handles. Bind +
register the listener under PROTO_TAG_LISTEN at construction time;
the net-poll thread sees readiness, accepts the connection (drains
WouldBlock), and feeds the existing InboundAccept channel. No
dedicated thread per rule.

Drops port_forward_accept_latency from ~50 ms to sub-millisecond
(divan microbench is host TcpStream::connect → first frame in
inject_to_guest, now bounded by epoll_wait latency in the active
5 ms cadence rather than a fixed 50 ms poll).

Removes:
- run_port_forward_listener (thread main loop)
- spawn_port_forward_listeners (now bind_port_forward_listeners)
- PORT_FORWARD_POLL_INTERVAL
- port_forward_shutdown Arc<AtomicBool>
- the Drop impl block that joined listener threads
9 bite-sized tasks covering the half-close state machine:
- Established → FinWait1 (guest closed first) → LastAck on host EOF
- Established → CloseWait (host closed first) → LastAck on guest FIN
- LastAck → Closed on guest's final ACK or LAST_ACK_TIMEOUT (60s)

Three new pins: tcp_half_close_guest_writes_first (BROKEN_ON_PURPOSE
flips at Task 3), tcp_half_close_host_writes_first (passes after
Task 3), tcp_last_ack_timeout_reaps_stale_entry (bench-helpers
gated, Task 7).

Severity: HIGH — current code drops host's pending response on
guest shutdown(SHUT_WR), causing silent data loss for HTTP, SMTP,
and any protocol with orderly half-close.
Guest sends FIN after data; current code marks state=Closed immediately
on guest FIN, so the host's response data is dropped. This pin locks
the broken behavior so the Phase 6.1 fix is legible to reviewers.

Flips to PASS when Tasks 2–3 implementation lands.
Tasks 2+3+4 implemented together: clippy's -D warnings requires our_fin_sent
to be used before committing the field addition, so the relay loop changes
from Task 3 and the ACK handler from Task 4 are bundled with Task 2's struct
changes.

Structural changes (TcpNatEntry):
- Add last_state_change: Instant — tracks when state last changed, used
  for LAST_ACK_TIMEOUT reaping (Task 6).
- Add our_fin_sent: bool — prevents re-sending FIN on repeated epoll
  readiness events for the same half-close transition.
- Initialize both fields at all three TcpNatEntry literal sites.

FIN handler (handle_tcp_frame):
- Established → FinWait1: ACK guest FIN, shutdown(Write) host socket
  so host sees EOF; stay alive for host's pending response data.
- CloseWait → LastAck: host already closed; guest just closed; ACK
  and wait for guest's final ACK of our FIN.

ACK handler (handle_tcp_frame):
- LastAck → Closed: guest's final ACK reaps the entry.
- Extend ACK-driven consume to FinWait1 state so bytes_in_flight
  drains and recv_peek eventually sees Ok(0) (host EOF).

Relay loop (relay_tcp_nat_data):
- Extend data relay to FinWait1 in addition to Established.
- Ok(0) arm now dispatches on state: Established → CloseWait,
  FinWait1 → LastAck (instead of immediate Closed).
- FIN-emit logic uses our_fin_sent guard and needs_fin predicate
  (CloseWait | LastAck) to send FIN exactly once per transition.

Test: update tcp_half_close_guest_writes_first to ACK received data
segments so the ACK-driven consume path drains the kernel buffer
and recv_peek sees EOF. Without ACKs the FinWait1 relay sees the
same bytes on every peek and never reaches Ok(0).
Symmetric mirror of tcp_half_close_guest_writes_first: host writes
GREETING, shuts down its write side (Established → CloseWait), and waits
for guest to reply; guest sends REPLY data + FIN.

Also fix two implementation gaps found while writing this pin:

1. Guest→host data forwarding in CloseWait: the payload forward guard
   was Established-only. CloseWait must also forward guest data (host
   closed its write side but can still read). Extend the predicate to
   include CloseWait.

2. shutdown(Write) on CloseWait → LastAck: when the guest sends FIN in
   CloseWait the host application is blocked on read_to_end. We must
   call shutdown(Write) on host_stream so the kernel delivers EOF to
   the host. Without this the host blocks forever.
Add LAST_ACK_TIMEOUT = 60 s (TCP MSL × 2) as a module-scope constant.

Merge the LastAck-timeout check into the existing idle-timeout sweep in
relay_tcp_nat_data, making a single O(n) pass handle both conditions:

- Standard idle-timeout (300 s of inactivity, any state): unchanged.
- LastAck-timeout (60 s since last_state_change, LastAck state): reaps
  entries where the guest's final ACK never arrived, e.g. after a guest
  crash. Logs at WARN so operators can connect the timeout to a prior
  half-close sequence without a separate debugging session.

The combined sweep avoids a second loop and an extra to_remove Vec
allocation. Using last_state_change (set on every state transition)
rather than last_activity (set on data relay) gives an accurate 60 s
window from the moment we sent our FIN.
…ers)

Add a bench-helpers–gated pin that verifies the LAST_ACK_TIMEOUT reaper:

1. Insert a synthetic LastAck entry via the new bench-helpers helper
   insert_synthetic_lastack_entry (TcpNatState::LastAck, our_fin_sent=true).
2. Back-date last_state_change by 70 s (> LAST_ACK_TIMEOUT = 60 s) via
   set_synthetic_last_state_change.
3. One drain_to_guest cycle runs the timeout sweep.
4. Assert the entry is gone (tcp_flow_state returns None).

Two new bench-helpers methods on SlirpBackend:
- insert_synthetic_lastack_entry: seeds a LastAck flow without a
  full half-close exchange, for timeout reaping tests.
- set_synthetic_last_state_change: back-dates last_state_change on an
  existing entry to simulate an expired LAST_ACK_TIMEOUT.

Also widen TcpNatState and tcp_flow_state to pub so the bench-helpers–
gated external test can reference them by path. Both were previously
pub(crate); they are now observable as part of the bench-helpers surface.
…FinWait2

All TcpNatState variants are now wired into the state machine:
- SynReceived, SynSent: handshake paths (unchanged)
- Established: normal data transfer (unchanged)
- FinWait1: guest half-closed; relay continues forwarding host response
- CloseWait: host half-closed; relay continues forwarding guest data
- LastAck: both sides closed; waiting for guest's final ACK or timeout
- Closed: entry pending removal

Remove FinWait2: distinguishing it from FinWait1 would require
observing per-segment ACKs from the kernel — the relay does not track
those. The relay stays in FinWait1 until host EOF, then jumps to LastAck.
If the distinction is needed later, the variant can be re-added.

Update the TcpNatState doc comment to document the state machine diagram
and the FinWait2 omission rationale, replacing the old per-variant stubs.
fmt:    cargo fmt --all -- --check                                PASS
clippy: cargo clippy --workspace --all-targets --all-features    PASS
tests:  cargo test -p void-box --all-features                    PASS (13 unit tests)
        cargo test --test network_baseline                       PASS (20 tests)
        cargo test --test network_baseline --features bench-helpers
                                                                 PASS (22 tests)
        cargo test --lib network::slirp                          PASS (7 tests)
bench:  cargo bench --bench network --features bench-helpers --no-run
                                                                 PASS (compiles)
build:  cargo build --release                                    PASS

Note: cargo test --workspace --all-features fails on guest-agent because
the test binary requires write access to /workspace (pre-existing
environment constraint, not introduced by this branch). All other crates
pass. The AGENTS.md validation contract does not gate on guest-agent
passing in non-root environments.

VM suite (conformance, oci_integration, snapshot_integration) skipped:
VOID_BOX_KERNEL and VOID_BOX_INITRAMFS not set in this environment.
Two artifacts from the rebase onto the new #69 tip:

1. bind_port_forward_listeners was called twice in with_security
   after the rebase — the second call shadowed the first, which
   was harmless but emitted an unused-variable warning under
   -D warnings. Drop the duplicate.

2. The bench-helpers helper insert_synthetic_lastack_entry was
   missing flow_token in its TcpNatEntry initializer (the field
   was added by the Copilot fix #1 monotonic-token rewrite that
   landed on #69 after the original phase 6.1 commit chain was
   written). Allocate via next_flow_token(PROTO_TAG_TCP).
@dpsoft dpsoft requested a review from Copilot May 4, 2026 19:41
@dpsoft dpsoft marked this pull request as ready for review May 4, 2026 19:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the SLIRP TCP relay’s close handling so host/guest half-closes are tracked explicitly instead of being treated as an immediate full close. It fits into the ongoing smoltcp/passt networking work by addressing silent data loss during shutdown(SHUT_WR).

Changes:

  • Adds half-close TCP states/transitions in SlirpBackend for guest-first and host-first shutdown paths.
  • Adds LAST_ACK_TIMEOUT cleanup for stale LastAck flows.
  • Adds baseline pins and an implementation plan covering the new half-close behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.

File Description
src/network/slirp.rs Implements the TCP half-close state machine, FIN handling, and LastAck timeout reaping.
tests/network_baseline.rs Adds regression pins for guest-first and host-first half-close flows plus stale LastAck cleanup.
docs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.1.md Adds the detailed Phase 6.1 implementation and validation plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/network/slirp.rs
Comment on lines +1853 to +1860
TcpNatState::Established => {
entry.guest_ack = seq.wrapping_add(1);

// ACK the guest's FIN — but don't send our own FIN yet. Host
// application may have data still to send. We transition to
// FinWait1 and shut down the host socket's write side so the
// host knows no more data is coming from the guest.
let ack_frame = build_tcp_packet_static(
Comment thread src/network/slirp.rs
Comment on lines +1892 to +1900
TcpNatState::CloseWait => {
// Host already closed its write side; guest just closed
// too. Shut down our write side of host_stream so the
// host application's read sees EOF, ACK the guest's FIN,
// and transition to LastAck waiting for guest's final ACK
// of our FIN (which was already sent when we entered
// CloseWait).
entry.guest_ack = seq.wrapping_add(1);
let ack_frame = build_tcp_packet_static(
Comment thread src/network/slirp.rs
dst_ip,
dst_port
);
return Ok(());
Comment thread src/network/slirp.rs
Comment on lines +1706 to +1713
// ACK while in LastAck — guest acknowledged our FIN. Reap.
// Placed before the SynReceived ACK branch to be explicit (the
// states are mutually exclusive, but explicit ordering is clearer).
if tcp.ack() && entry.state == TcpNatState::LastAck {
debug!("SLIRP TCP: LastAck → Closed for {}:{}", dst_ip, dst_port);
entry.state = TcpNatState::Closed;
self.pending_close.push(flow_key);
return Ok(());
| `tests/network_baseline.rs` | Three new pins. |
| Snapshot serde | Verify the new state transitions round-trip; if not, document the cleanup-on-restore behavior. |

No public API change.
- Verify the ACK path correctly transitions LastAck → Closed without consuming a SYN-state ACK by mistake.
- Verify `last_state_change` is updated on every state transition, not just initial.
- Verify guest writes during CloseWait are still relayed to host (host has full read side; only its write side is closed, which it learns from our shutdown — but writes from us to it are valid).
- Snapshot interaction: `TcpNatState` serde already exists; new fields (`last_state_change`, `our_fin_sent`) need `#[serde(default)]` for backward compatibility with pre-6.1 snapshots.
Comment thread tests/network_baseline.rs
Comment on lines +1297 to +1299
/// BROKEN_ON_PURPOSE: guest sends FIN after data; current code marks state=Closed
/// immediately on guest FIN, so host's response data is dropped.
/// Flips to PASS when Task 2–3 implementation lands.
Comment thread tests/network_baseline.rs
Comment on lines +1357 to +1379
// Guest sends "HELLO" data + FIN.
let request = b"HELLO";
stack
.process_guest_frame(&build_tcp_frame(
SLIRP_GATEWAY_IP,
GUEST_EPHEMERAL_PORT,
host_port,
our_seq + 1,
gateway_seq + 1,
TcpControl::Psh,
request,
))
.unwrap();
stack
.process_guest_frame(&build_tcp_frame(
SLIRP_GATEWAY_IP,
GUEST_EPHEMERAL_PORT,
host_port,
our_seq + 1 + request.len() as u32,
gateway_seq + 1,
TcpControl::Fin,
&[],
))
Comment thread tests/network_baseline.rs
Comment on lines +1555 to +1578
// Guest sends reply data + FIN after receiving host's greeting and FIN.
// The relay is now in CloseWait; guest data should still be forwarded to host.
let reply = b"REPLY";
stack
.process_guest_frame(&build_tcp_frame(
SLIRP_GATEWAY_IP,
GUEST_EPHEMERAL_PORT,
host_port,
our_seq + 1,
gateway_seq_next.wrapping_add(1),
TcpControl::Psh,
reply,
))
.unwrap();
stack
.process_guest_frame(&build_tcp_frame(
SLIRP_GATEWAY_IP,
GUEST_EPHEMERAL_PORT,
host_port,
our_seq + 1 + reply.len() as u32,
gateway_seq_next.wrapping_add(1),
TcpControl::Fin,
&[],
))
Comment thread src/network/slirp.rs
#[derive(Debug, Clone, Copy, PartialEq)]
#[allow(dead_code)]
pub(crate) enum TcpNatState {
pub enum TcpNatState {
@dpsoft
Copy link
Copy Markdown
Contributor Author

dpsoft commented May 6, 2026

superseded by #76

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants