leaderelection: fix raft DoSend stall when one peer silently black-holes by hidalgopl · Pull Request #1489 · redpanda-data/redpanda-operator

hidalgopl · 2026-04-28T11:33:09Z

Summary

Fixes finding #1 from the multicluster failure-modes memo: a leader
in a 3-node StretchCluster could step down despite a healthy majority
when a single follower's network silently black-holed (TCP packets
dropped, no RST, no FIN). Reproduced live in vind by injecting 100%
packet loss with Pumba on the leader→follower_1 link; healthy
follower_2 was starved of heartbeats, CheckQuorum tripped, leader
stepped down.

Root cause

The raft Ready loop iterated rd.Messages serially and called the
gRPC DoSend synchronously. Without HTTP/2 keepalives, a
silently-dropped TCP stream stayed "open" until Linux's tcp_retries2
ceiling (~15 min). Each DoSend to the broken peer therefore blocked
on the orphan stream, and every other peer's send waited behind it
— so the healthy follower never received heartbeats inside one
ElectionTimeout.

Fix

Three commits, each independently meaningful:

HTTP/2 keepalives on peer gRPC connections (a38a7677 →
rebased) — close the orphan stream within ~13s of the first
missed pong instead of waiting 15 minutes. Necessary but not
sufficient: the loop still progresses at the slowest peer's pace.
Per-peer fan-out (8c4b761b → rebased) — each peer gets a
bounded send queue (256) and a dedicated worker. The Ready loop
does a non-blocking EnqueueSend per message and finishes in
microseconds regardless of peer health. Snapshot-on-reject and
ReportUnreachable move into the worker, behaviour-equivalent.
Drop observability (b9ac8086 → rebased) — atomic per-peer
drop counter bumped on full-queue, with a 30s periodic logger
that surfaces non-zero deltas. Healthy clusters stay log-silent;
chronic saturation becomes visible in operator logs.

A characterization-test commit precedes the fixes, including the
asymmetric-silent-drop regression test that's red without the
fan-out and green with it.

Validation

Full pkg/multicluster/leaderelection package passes in ~40s.
The TestLeaderSurvives_AsymmetricSilentDropOnOnePeer regression
test confirms the leader keeps its role through a 15s window of
asymmetric silent drop on one peer.
Live re-run of hack/multicluster/repro/finding-01-dosend-stall.sh
on a fresh 3-node vind env: leader stayed stable for the full 180s
injection window, zero new elections, zero "became follower"
events.

Risks acknowledged

Full-queue drops are silent on the hot path. Mitigated by the
per-peer counter + periodic log (commit 4).
sendTimeout = HeartbeatInterval bounds every RPC, not just
heartbeats; large MsgApp catch-up bursts on a slow link could
legitimately time out. Tradeoff accepted — a faster timeout is
the bug we're fixing.
No defer recover() in worker goroutines: a panic kills one
peer's send path silently. Conscious choice — surfaces
programming bugs loudly instead of papering over them.

Document current higher-level raft behavior under network conditions so regressions in the transport layer can be caught at the state- machine level, not just at the DoSend unit-test level. These tests complement the integration harness that already exists for snapshot/rejoin flows. Adds a BlockIngress hook on TestHooks that chaos tests flip to make incoming Send/Check RPCs hang until the caller's context cancels — simulating a silently unresponsive peer without tearing down the TCP listener (which would produce immediate connection-refused errors, a different failure mode from what we want to cover). Three tests land in a new chaos_test.go: TestLeaderRemainsStable_WhenAllPeersHealthy Happy-path regression control: once elected, the leader must not change in the absence of faults. TestLeaderSteppedDown_WhenMinorityIsolated Negative control: raft must relinquish leadership when the leader cannot reach a majority. Stopping both followers triggers CheckQuorum as expected. TestLeaderSurvives_AsymmetricSilentDropOnOnePeer Characterises finding #1 in the resilience memo — a silent black-hole on one peer causes the leader to step down today despite a healthy majority. Left in place, t.Skip'd with a pointer to the fix, so it will automatically start passing once the per-peer fan-out work lands in the raft transport. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…onnections Without an explicit keepalive policy, a silently black-holed TCP link (packets dropped with no RST — the realistic cloud-zone outage case) keeps the underlying HTTP/2 stream "open" until the Linux TCP retransmit ceiling fires (tcp_retries2 ≈ 15 minutes by default). During that window every DoSend from the raft Ready loop blocks on the orphan stream, and CheckQuorum can step the leader down even though a majority of peers are healthy. Add baseline keepalive.ClientParameters to every peer dial: Time: 10 * time.Second (idle-ping interval) Timeout: 3 * time.Second (pong deadline) PermitWithoutStream: true (ping even when idle) With these, a blackholed connection closes within ~13s of the first missed pong and the next DoSend attempt dials fresh. This is the minimum necessary fix for finding #1 in the resilience memo; the structural fan-out work that follows removes Ready-loop head-of-line blocking entirely but benefits from having keepalive as defense in depth. No behaviour change for healthy connections; no public API change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The raft Ready loop iterated messages serially and called DoSend synchronously. One slow or silently black-holed peer blocked the caller's goroutine, preventing heartbeats to every other peer — even with keepalives and a per-RPC timeout in place, the loop still progressed at the pace of the slowest peer. Under the memo's finding-#1 scenario (100% packet loss on leader->follower_1, all other links healthy), the healthy follower was starved of heartbeats and CheckQuorum stepped the leader down despite a healthy majority. Decouple per-peer sends from the Ready loop: - Each peer gets a bounded send queue (chan raftpb.Message, 256) and a dedicated worker goroutine (runPeerSender) that drains it. - EnqueueSend is the new API the Ready loop uses: non-blocking enqueue; on full queue the message is dropped and raft retransmits on the next tick. - The snapshot-on-reject fallback and ReportUnreachable call that used to run inline in the Ready loop now run in the worker. No behaviour change to those mechanisms; they just stop blocking the main goroutine. - DoSend keeps its existing synchronous semantics (now wrapped in a per-RPC sendTimeout derived from HeartbeatInterval) and is the worker's building block. External callers continue to have a sync-send API if they need one. - Workers start at transport.Run() and exit cleanly on ctx cancel. The finding-#1 chaos test (TestLeaderSurvives_AsymmetricSilentDropOnOnePeer) moves from t.Skip to a full assertion: the leader keeps its role and the healthy follower does not start an election throughout a 15s window of asymmetric silent drop. Full leaderelection package tests pass in ~40s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The fan-out fix (8c4b761b) made the raft Ready loop non-blocking by moving each peer's send onto a bounded channel (peerSendQueueSize=256) served by a dedicated worker goroutine. EnqueueSend silently drops when that queue is full, on the assumption that raft will retransmit on the next heartbeat tick — which it does for heartbeats, and eventually for MsgApp via the leader's progress tracker. That assumption is correct, but in production it leaves us blind: a chronically saturated peer (slow link, struggling worker, GC pause) can sit indefinitely with full queues and we'd see only the downstream symptoms (slow commits, occasional ReportUnreachable churn) without knowing that the queue itself is the bottleneck. The fan-out fix made the failure mode quieter, not louder. Add minimal observability without compromising the hot path: - peer.dropCount (atomic.Uint64) is bumped from EnqueueSend's full-queue branch. Atomic so the producer side stays lock-free and the cost on the drop path is one CAS. - runDropLogger is a single goroutine started from Run() that wakes every 30s, computes per-peer deltas against the previous sample, and emits one Infof per peer with non-zero delta. A healthy cluster never drops, so the goroutine is log-silent in the steady state. - 30s is short enough that a sustained-but-bursty saturation surfaces within one operator log scrape window, but long enough to avoid coupling log volume to drop volume on a misbehaving peer. - The logger is per-transport and exits on ctx.Done() like the per-peer workers. Deliberately not added: panic recovery in the worker. A panicking worker is a programming bug we want surfaced loudly, not papered over with a restart loop that hides the original stack. Tests: full leaderelection package passes in ~40s (same as before; the new goroutine is idle on healthy paths). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hidalgopl and others added 3 commits April 28, 2026 11:06

hidalgopl requested review from RafalKorepta, andrewstucki, chrisseto and gene-redpanda as code owners April 28, 2026 11:33

hidalgopl marked this pull request as draft April 28, 2026 11:33

hidalgopl force-pushed the pb/fix-raft-dosend-keepalives branch from 0f855ac to ec9c830 Compare April 28, 2026 11:56

hidalgopl added the no-changelog label Apr 28, 2026

RafalKorepta approved these changes Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

leaderelection: fix raft DoSend stall when one peer silently black-holes#1489

leaderelection: fix raft DoSend stall when one peer silently black-holes#1489
hidalgopl wants to merge 4 commits intomainfrom
pb/fix-raft-dosend-keepalives

hidalgopl commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hidalgopl commented Apr 28, 2026

Summary

Root cause

Fix

Validation

Risks acknowledged

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants