leaderelection: fix raft DoSend stall when one peer silently black-holes#1489
Draft
leaderelection: fix raft DoSend stall when one peer silently black-holes#1489
Conversation
Document current higher-level raft behavior under network conditions
so regressions in the transport layer can be caught at the state-
machine level, not just at the DoSend unit-test level. These tests
complement the integration harness that already exists for
snapshot/rejoin flows.
Adds a BlockIngress hook on TestHooks that chaos tests flip to make
incoming Send/Check RPCs hang until the caller's context cancels —
simulating a silently unresponsive peer without tearing down the TCP
listener (which would produce immediate connection-refused errors,
a different failure mode from what we want to cover).
Three tests land in a new chaos_test.go:
TestLeaderRemainsStable_WhenAllPeersHealthy
Happy-path regression control: once elected, the leader must not
change in the absence of faults.
TestLeaderSteppedDown_WhenMinorityIsolated
Negative control: raft must relinquish leadership when the leader
cannot reach a majority. Stopping both followers triggers
CheckQuorum as expected.
TestLeaderSurvives_AsymmetricSilentDropOnOnePeer
Characterises finding #1 in the resilience memo — a silent
black-hole on one peer causes the leader to step down today
despite a healthy majority. Left in place, t.Skip'd with a
pointer to the fix, so it will automatically start passing once
the per-peer fan-out work lands in the raft transport.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…onnections
Without an explicit keepalive policy, a silently black-holed TCP link
(packets dropped with no RST — the realistic cloud-zone outage case)
keeps the underlying HTTP/2 stream "open" until the Linux TCP
retransmit ceiling fires (tcp_retries2 ≈ 15 minutes by default). During
that window every DoSend from the raft Ready loop blocks on the orphan
stream, and CheckQuorum can step the leader down even though a
majority of peers are healthy.
Add baseline keepalive.ClientParameters to every peer dial:
Time: 10 * time.Second (idle-ping interval)
Timeout: 3 * time.Second (pong deadline)
PermitWithoutStream: true (ping even when idle)
With these, a blackholed connection closes within ~13s of the first
missed pong and the next DoSend attempt dials fresh. This is the
minimum necessary fix for finding #1 in the resilience memo; the
structural fan-out work that follows removes Ready-loop head-of-line
blocking entirely but benefits from having keepalive as defense in
depth.
No behaviour change for healthy connections; no public API change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The raft Ready loop iterated messages serially and called DoSend synchronously. One slow or silently black-holed peer blocked the caller's goroutine, preventing heartbeats to every other peer — even with keepalives and a per-RPC timeout in place, the loop still progressed at the pace of the slowest peer. Under the memo's finding-#1 scenario (100% packet loss on leader->follower_1, all other links healthy), the healthy follower was starved of heartbeats and CheckQuorum stepped the leader down despite a healthy majority. Decouple per-peer sends from the Ready loop: - Each peer gets a bounded send queue (chan raftpb.Message, 256) and a dedicated worker goroutine (runPeerSender) that drains it. - EnqueueSend is the new API the Ready loop uses: non-blocking enqueue; on full queue the message is dropped and raft retransmits on the next tick. - The snapshot-on-reject fallback and ReportUnreachable call that used to run inline in the Ready loop now run in the worker. No behaviour change to those mechanisms; they just stop blocking the main goroutine. - DoSend keeps its existing synchronous semantics (now wrapped in a per-RPC sendTimeout derived from HeartbeatInterval) and is the worker's building block. External callers continue to have a sync-send API if they need one. - Workers start at transport.Run() and exit cleanly on ctx cancel. The finding-#1 chaos test (TestLeaderSurvives_AsymmetricSilentDropOnOnePeer) moves from t.Skip to a full assertion: the leader keeps its role and the healthy follower does not start an election throughout a 15s window of asymmetric silent drop. Full leaderelection package tests pass in ~40s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fan-out fix (8c4b761b) made the raft Ready loop non-blocking by
moving each peer's send onto a bounded channel (peerSendQueueSize=256)
served by a dedicated worker goroutine. EnqueueSend silently drops
when that queue is full, on the assumption that raft will retransmit
on the next heartbeat tick — which it does for heartbeats, and
eventually for MsgApp via the leader's progress tracker.
That assumption is correct, but in production it leaves us blind: a
chronically saturated peer (slow link, struggling worker, GC pause)
can sit indefinitely with full queues and we'd see only the
downstream symptoms (slow commits, occasional ReportUnreachable
churn) without knowing that the queue itself is the bottleneck. The
fan-out fix made the failure mode quieter, not louder.
Add minimal observability without compromising the hot path:
- peer.dropCount (atomic.Uint64) is bumped from EnqueueSend's
full-queue branch. Atomic so the producer side stays lock-free
and the cost on the drop path is one CAS.
- runDropLogger is a single goroutine started from Run() that
wakes every 30s, computes per-peer deltas against the previous
sample, and emits one Infof per peer with non-zero delta. A
healthy cluster never drops, so the goroutine is log-silent in
the steady state.
- 30s is short enough that a sustained-but-bursty saturation
surfaces within one operator log scrape window, but long enough
to avoid coupling log volume to drop volume on a misbehaving
peer.
- The logger is per-transport and exits on ctx.Done() like the
per-peer workers.
Deliberately not added: panic recovery in the worker. A panicking
worker is a programming bug we want surfaced loudly, not papered
over with a restart loop that hides the original stack.
Tests: full leaderelection package passes in ~40s (same as before;
the new goroutine is idle on healthy paths).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0f855ac to
ec9c830
Compare
RafalKorepta
approved these changes
Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes finding #1 from the multicluster failure-modes memo: a leader
in a 3-node StretchCluster could step down despite a healthy majority
when a single follower's network silently black-holed (TCP packets
dropped, no RST, no FIN). Reproduced live in vind by injecting 100%
packet loss with Pumba on the leader→follower_1 link; healthy
follower_2 was starved of heartbeats, CheckQuorum tripped, leader
stepped down.
Root cause
The raft Ready loop iterated
rd.Messagesserially and called thegRPC
DoSendsynchronously. Without HTTP/2 keepalives, asilently-dropped TCP stream stayed "open" until Linux's tcp_retries2
ceiling (~15 min). Each
DoSendto the broken peer therefore blockedon the orphan stream, and every other peer's send waited behind it
— so the healthy follower never received heartbeats inside one
ElectionTimeout.
Fix
Three commits, each independently meaningful:
a38a7677→rebased) — close the orphan stream within ~13s of the first
missed pong instead of waiting 15 minutes. Necessary but not
sufficient: the loop still progresses at the slowest peer's pace.
8c4b761b→ rebased) — each peer gets abounded send queue (256) and a dedicated worker. The Ready loop
does a non-blocking
EnqueueSendper message and finishes inmicroseconds regardless of peer health. Snapshot-on-reject and
ReportUnreachablemove into the worker, behaviour-equivalent.b9ac8086→ rebased) — atomic per-peerdrop counter bumped on full-queue, with a 30s periodic logger
that surfaces non-zero deltas. Healthy clusters stay log-silent;
chronic saturation becomes visible in operator logs.
A characterization-test commit precedes the fixes, including the
asymmetric-silent-drop regression test that's red without the
fan-out and green with it.
Validation
pkg/multicluster/leaderelectionpackage passes in ~40s.TestLeaderSurvives_AsymmetricSilentDropOnOnePeerregressiontest confirms the leader keeps its role through a 15s window of
asymmetric silent drop on one peer.
hack/multicluster/repro/finding-01-dosend-stall.shon a fresh 3-node vind env: leader stayed stable for the full 180s
injection window, zero new elections, zero "became follower"
events.
Risks acknowledged
per-peer counter + periodic log (commit 4).
sendTimeout = HeartbeatIntervalbounds every RPC, not justheartbeats; large MsgApp catch-up bursts on a slow link could
legitimately time out. Tradeoff accepted — a faster timeout is
the bug we're fixing.
defer recover()in worker goroutines: a panic kills onepeer's send path silently. Conscious choice — surfaces
programming bugs loudly instead of papering over them.