Skip to content

fix: don't block all partitions while one is initializing#118

Merged
novatechflow merged 2 commits intoKafScale:mainfrom
klaudworks:fix/logmu-head-of-line-blocking
Feb 28, 2026
Merged

fix: don't block all partitions while one is initializing#118
novatechflow merged 2 commits intoKafScale:mainfrom
klaudworks:fix/logmu-head-of-line-blocking

Conversation

@klaudworks
Copy link
Collaborator

@klaudworks klaudworks commented Feb 28, 2026

Problem

After a broker restart, partition state is rebuilt from S3 on first access. getPartitionLog holds a single global lock while doing this — listing segments, downloading footers and indexes. If a partition has many segments, this can take seconds, and during that time every produce, fetch, and list-offsets request on the entire broker is blocked, even for unrelated topics.

We hit this during benchmarking: producing to a brand new empty topic hung because another partition was rebuilding its index from ~300k S3 segments.

Fix

  • Use a read lock for the common case (partition already initialized) so requests don't block each other
  • Move all S3 and etcd I/O outside the lock
  • Use singleflight to make sure only one goroutine initializes a given partition, without blocking other partitions from initializing in parallel

@klaudworks klaudworks force-pushed the fix/logmu-head-of-line-blocking branch from 85656c7 to c9ef3d7 Compare February 28, 2026 09:46
@klaudworks klaudworks changed the title fix: eliminate head-of-line blocking in getPartitionLog fix: don't block all partitions while one is initializing Feb 28, 2026
logMu was held as an exclusive lock across etcd and S3 I/O during
partition initialization. A single slow RestoreFromS3 (1+2N S3
round-trips for N segments) blocked all produce, fetch, and
list-offsets requests broker-wide.

Replace sync.Mutex with sync.RWMutex so the fast path (partition
already initialized) uses a shared read lock. Move all I/O outside
the lock and use singleflight.Group to deduplicate concurrent
initialization per partition without blocking other partitions.
@klaudworks klaudworks force-pushed the fix/logmu-head-of-line-blocking branch from c9ef3d7 to cc9d7f6 Compare February 28, 2026 09:51
Copy link
Collaborator

@novatechflow novatechflow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smart! Thanks again @klaudworks

@novatechflow novatechflow merged commit 93c4d83 into KafScale:main Feb 28, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants