Skip to content

fix(tcp): auto-grow epoll event buffer to prevent connection starvation#67

Open
louisponet wants to merge 1 commit intomainfrom
lopo/epoll-event-buffer-auto-grow
Open

fix(tcp): auto-grow epoll event buffer to prevent connection starvation#67
louisponet wants to merge 1 commit intomainfrom
lopo/epoll-event-buffer-auto-grow

Conversation

@louisponet
Copy link
Copy Markdown
Contributor

@louisponet louisponet commented Mar 31, 2026

Problem

Events::with_capacity(128) in TcpConnector caused epoll event starvation when >128 connections were active (e.g. the data gather receiver).

Mio uses edge-triggered epoll — connections that don't fit in the 128-event batch are deferred to the next poll() cycle. With heavy connections dominating each batch, lighter connections experienced ~1s read delays. The kernel measured this as rcv_rtt: 999ms, auto-tuned rcv_space down to ~14KB, and advertised a tiny TCP receive window — causing the sender to be rwnd_limited: 99.7% of the time.

Diagnosis (from ss -tnip on both sides)

Sender side:

  • rwnd_limited: 99.7% — blocked on receiver's window almost all the time
  • snd_wnd: 73728 (72KB) — tiny window from receiver
  • Send-Q: 931816 — ~1MB queued in kernel

Receiver side (starved connection):

  • rcv_space: 14480 (14KB) — kernel auto-tuned window down
  • rcv_rtt: 999ms — kernel thinks app takes ~1s between reads

Receiver side (healthy connection):

  • rcv_space: 4734960 (4.7MB) — normal
  • bytes_received: 3.8GB — orders of magnitude more throughput

Fix

  • After each poll(), if the number of returned events equals the buffer capacity, double it
  • Log at info level when growth occurs ("epoll event buffer full, growing")
  • Add .with_event_capacity(n) builder method for callers that want a different starting size
  • Default remains 128 — auto-grows as needed with zero overhead once stabilized

Open with Devin

@louisponet louisponet requested a review from a team March 31, 2026 03:50
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

Open in Devin Review

Comment on lines +635 to +643
fn maybe_grow_events(&mut self, n_events: usize) {
// mio returns at most `capacity` events; hitting the limit means
// there were likely more fds ready than we could service.
if n_events >= self.event_capacity {
let new_cap = self.event_capacity * 2;
info!(old = self.event_capacity, new = new_cap, "epoll event buffer full, growing");
self.event_capacity = new_cap;
self.events = Events::with_capacity(new_cap);
}
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 maybe_grow_events never grows from zero capacity due to 0 * 2 == 0

If with_event_capacity(0) is called, maybe_grow_events enters a degenerate state: n_events >= 0 is always true for usize, so the growth branch fires every poll cycle, but 0 * 2 = 0 means the capacity never actually increases. This causes: (1) the connector never processes any IO events since Events::with_capacity(0) can never return events from poll(), making the connector completely non-functional, and (2) an info! log is emitted on every single poll_with call, creating infinite log spam.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@louisponet louisponet force-pushed the lopo/epoll-event-buffer-auto-grow branch 2 times, most recently from b702902 to 76f268a Compare March 31, 2026 05:42
…igurable nodelay, EOF logging

Four fixes/improvements for TcpConnector:

1. Auto-grow epoll event buffer to prevent connection starvation.

2. Apply TCP_USER_TIMEOUT to accepted inbound connections (was only set on
   outbound).

3. Make TCP_NODELAY configurable via with_nodelay(bool). Default remains
   true (Nagle disabled).

4. Add debug logging to silent EOF paths in read_frame. Previously,
   Ok(0) during header/payload reads returned Disconnected with no log,
   making it impossible to distinguish peer-closed from I/O errors.
   Now logs peer address and read progress.
@louisponet louisponet force-pushed the lopo/epoll-event-buffer-auto-grow branch from 76f268a to 895a7c7 Compare March 31, 2026 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant