Skip to content

leaderelection: fix raft DoSend stall when one peer silently black-holes#1489

Draft
hidalgopl wants to merge 4 commits intomainfrom
pb/fix-raft-dosend-keepalives
Draft

leaderelection: fix raft DoSend stall when one peer silently black-holes#1489
hidalgopl wants to merge 4 commits intomainfrom
pb/fix-raft-dosend-keepalives

Conversation

@hidalgopl
Copy link
Copy Markdown
Contributor

Summary

Fixes finding #1 from the multicluster failure-modes memo: a leader
in a 3-node StretchCluster could step down despite a healthy majority
when a single follower's network silently black-holed (TCP packets
dropped, no RST, no FIN). Reproduced live in vind by injecting 100%
packet loss with Pumba on the leader→follower_1 link; healthy
follower_2 was starved of heartbeats, CheckQuorum tripped, leader
stepped down.

Root cause

The raft Ready loop iterated rd.Messages serially and called the
gRPC DoSend synchronously. Without HTTP/2 keepalives, a
silently-dropped TCP stream stayed "open" until Linux's tcp_retries2
ceiling (~15 min). Each DoSend to the broken peer therefore blocked
on the orphan stream, and every other peer's send waited behind it
— so the healthy follower never received heartbeats inside one
ElectionTimeout.

Fix

Three commits, each independently meaningful:

  1. HTTP/2 keepalives on peer gRPC connections (a38a7677
    rebased) — close the orphan stream within ~13s of the first
    missed pong instead of waiting 15 minutes. Necessary but not
    sufficient: the loop still progresses at the slowest peer's pace.
  2. Per-peer fan-out (8c4b761b → rebased) — each peer gets a
    bounded send queue (256) and a dedicated worker. The Ready loop
    does a non-blocking EnqueueSend per message and finishes in
    microseconds regardless of peer health. Snapshot-on-reject and
    ReportUnreachable move into the worker, behaviour-equivalent.
  3. Drop observability (b9ac8086 → rebased) — atomic per-peer
    drop counter bumped on full-queue, with a 30s periodic logger
    that surfaces non-zero deltas. Healthy clusters stay log-silent;
    chronic saturation becomes visible in operator logs.

A characterization-test commit precedes the fixes, including the
asymmetric-silent-drop regression test that's red without the
fan-out and green with it.

Validation

  • Full pkg/multicluster/leaderelection package passes in ~40s.
  • The TestLeaderSurvives_AsymmetricSilentDropOnOnePeer regression
    test confirms the leader keeps its role through a 15s window of
    asymmetric silent drop on one peer.
  • Live re-run of hack/multicluster/repro/finding-01-dosend-stall.sh
    on a fresh 3-node vind env: leader stayed stable for the full 180s
    injection window, zero new elections, zero "became follower"
    events.

Risks acknowledged

  • Full-queue drops are silent on the hot path. Mitigated by the
    per-peer counter + periodic log (commit 4).
  • sendTimeout = HeartbeatInterval bounds every RPC, not just
    heartbeats; large MsgApp catch-up bursts on a slow link could
    legitimately time out. Tradeoff accepted — a faster timeout is
    the bug we're fixing.
  • No defer recover() in worker goroutines: a panic kills one
    peer's send path silently. Conscious choice — surfaces
    programming bugs loudly instead of papering over them.

hidalgopl and others added 3 commits April 28, 2026 11:06
Document current higher-level raft behavior under network conditions
so regressions in the transport layer can be caught at the state-
machine level, not just at the DoSend unit-test level. These tests
complement the integration harness that already exists for
snapshot/rejoin flows.

Adds a BlockIngress hook on TestHooks that chaos tests flip to make
incoming Send/Check RPCs hang until the caller's context cancels —
simulating a silently unresponsive peer without tearing down the TCP
listener (which would produce immediate connection-refused errors,
a different failure mode from what we want to cover).

Three tests land in a new chaos_test.go:

  TestLeaderRemainsStable_WhenAllPeersHealthy
    Happy-path regression control: once elected, the leader must not
    change in the absence of faults.

  TestLeaderSteppedDown_WhenMinorityIsolated
    Negative control: raft must relinquish leadership when the leader
    cannot reach a majority. Stopping both followers triggers
    CheckQuorum as expected.

  TestLeaderSurvives_AsymmetricSilentDropOnOnePeer
    Characterises finding #1 in the resilience memo — a silent
    black-hole on one peer causes the leader to step down today
    despite a healthy majority. Left in place, t.Skip'd with a
    pointer to the fix, so it will automatically start passing once
    the per-peer fan-out work lands in the raft transport.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…onnections

Without an explicit keepalive policy, a silently black-holed TCP link
(packets dropped with no RST — the realistic cloud-zone outage case)
keeps the underlying HTTP/2 stream "open" until the Linux TCP
retransmit ceiling fires (tcp_retries2 ≈ 15 minutes by default). During
that window every DoSend from the raft Ready loop blocks on the orphan
stream, and CheckQuorum can step the leader down even though a
majority of peers are healthy.

Add baseline keepalive.ClientParameters to every peer dial:

    Time:                10 * time.Second  (idle-ping interval)
    Timeout:             3  * time.Second  (pong deadline)
    PermitWithoutStream: true              (ping even when idle)

With these, a blackholed connection closes within ~13s of the first
missed pong and the next DoSend attempt dials fresh. This is the
minimum necessary fix for finding #1 in the resilience memo; the
structural fan-out work that follows removes Ready-loop head-of-line
blocking entirely but benefits from having keepalive as defense in
depth.

No behaviour change for healthy connections; no public API change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The raft Ready loop iterated messages serially and called DoSend
synchronously. One slow or silently black-holed peer blocked the
caller's goroutine, preventing heartbeats to every other peer —
even with keepalives and a per-RPC timeout in place, the loop still
progressed at the pace of the slowest peer. Under the memo's
finding-#1 scenario (100% packet loss on leader->follower_1, all
other links healthy), the healthy follower was starved of heartbeats
and CheckQuorum stepped the leader down despite a healthy majority.

Decouple per-peer sends from the Ready loop:

  - Each peer gets a bounded send queue (chan raftpb.Message, 256)
    and a dedicated worker goroutine (runPeerSender) that drains it.
  - EnqueueSend is the new API the Ready loop uses: non-blocking
    enqueue; on full queue the message is dropped and raft retransmits
    on the next tick.
  - The snapshot-on-reject fallback and ReportUnreachable call that
    used to run inline in the Ready loop now run in the worker. No
    behaviour change to those mechanisms; they just stop blocking the
    main goroutine.
  - DoSend keeps its existing synchronous semantics (now wrapped in a
    per-RPC sendTimeout derived from HeartbeatInterval) and is the
    worker's building block. External callers continue to have a
    sync-send API if they need one.
  - Workers start at transport.Run() and exit cleanly on ctx cancel.

The finding-#1 chaos test (TestLeaderSurvives_AsymmetricSilentDropOnOnePeer)
moves from t.Skip to a full assertion: the leader keeps its role and
the healthy follower does not start an election throughout a 15s
window of asymmetric silent drop. Full leaderelection package tests
pass in ~40s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fan-out fix (8c4b761b) made the raft Ready loop non-blocking by
moving each peer's send onto a bounded channel (peerSendQueueSize=256)
served by a dedicated worker goroutine. EnqueueSend silently drops
when that queue is full, on the assumption that raft will retransmit
on the next heartbeat tick — which it does for heartbeats, and
eventually for MsgApp via the leader's progress tracker.

That assumption is correct, but in production it leaves us blind: a
chronically saturated peer (slow link, struggling worker, GC pause)
can sit indefinitely with full queues and we'd see only the
downstream symptoms (slow commits, occasional ReportUnreachable
churn) without knowing that the queue itself is the bottleneck. The
fan-out fix made the failure mode quieter, not louder.

Add minimal observability without compromising the hot path:

  - peer.dropCount (atomic.Uint64) is bumped from EnqueueSend's
    full-queue branch. Atomic so the producer side stays lock-free
    and the cost on the drop path is one CAS.
  - runDropLogger is a single goroutine started from Run() that
    wakes every 30s, computes per-peer deltas against the previous
    sample, and emits one Infof per peer with non-zero delta. A
    healthy cluster never drops, so the goroutine is log-silent in
    the steady state.
  - 30s is short enough that a sustained-but-bursty saturation
    surfaces within one operator log scrape window, but long enough
    to avoid coupling log volume to drop volume on a misbehaving
    peer.
  - The logger is per-transport and exits on ctx.Done() like the
    per-peer workers.

Deliberately not added: panic recovery in the worker. A panicking
worker is a programming bug we want surfaced loudly, not papered
over with a restart loop that hides the original stack.

Tests: full leaderelection package passes in ~40s (same as before;
the new goroutine is idle on healthy paths).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hidalgopl hidalgopl force-pushed the pb/fix-raft-dosend-keepalives branch from 0f855ac to ec9c830 Compare April 28, 2026 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants