Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions designs/warmreplicas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
Add support for warm replicas
===================

## Motivation
Controllers reconcile all objects during startup / leader election failover to account for changes
in the reconciliation logic. For certain sources, the time to do the cache sync can be
significant in the order of minutes. This is problematic because by default controllers (and by
extension watches) do not start until they have won leader election. This implies guaranteed
downtime as even after leader election, the controller has to wait for the initial list to be served
before it can start reconciling.

## Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there's a lot of churn and the controller doesn't become leader maybe not at all or maybe after a few days?

The queue length will increase while there is nothing that takes items out of the queue.

I know the queue doesn't require significant memory to store an item but is there something we should be concerned about if the queue has eg millions of items (let's say we watch pods and we don't become leader for a month)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worst case it at some point gets oom killed, restarts, does everything again, I don't think this is likely to become an actual issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah definitely not a big problem. Maybe just good to know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the queue have deduplication built in? I thought that you couldn't re-add a key to the queue if the key already existed within the queue.

In which case, when you perform the initial list, it adds every item to the queue, and then every object created after would be added to the queue, but the queue length would always be equal to the number of those objects in the cluster

We should double check if this is actually how they work

Copy link
Member

@sbueringer sbueringer May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the queue have deduplication built in? I thought that you couldn't re-add a key to the queue if the key already existed within the queue.

Yes there is de-duplication. We are basically talking about a map :)

queue length would always be equal to the number of those objects in the cluster

I don't think that deleted objects are removed from our queue. So if you have churn (i.e. any objects are deleted) the queue length won't be equal to the number of objects in the cluster.

But the queue only stores namespace + name. So it doesn't require a lot of memory per object.

A warm replica is a replica with a queue pre-filled by sources started even when leader election is
not won so that it is ready to start processing items as soon as the leader election is won. This
proposal aims to add support for warm replicas in controller-runtime.

### Context
Mostly written to confirm my understanding, but also to provide context for the proposal.

Controllers are a monolithic runnable with a `Start(ctx)` that
1. Starts the watches [here](https://github.com/kubernetes-sigs/controller-runtime/blob/v0.20.2/pkg/internal/controller/controller.go#L196-L213)
2. Starts the workers [here](https://github.com/kubernetes-sigs/controller-runtime/blob/v0.20.2/pkg/internal/controller/controller.go#L244-L252)
There needs to be a way to decouple the two so that the watches can be started before the workers
even as part of the same Runnable.

If a runnable implements the `LeaderElectionRunnable` interface, the return value of the
`NeedLeaderElection` function dictates whether or not it gets binned into the leader election
runnables group [code](https://github.com/kubernetes-sigs/controller-runtime/blob/v0.20.2/pkg/manager/runnable_group.go).

Runnables in the leader election group are started only after the manager has won leader election,
and all controllers are leader election runnables by default.

### Design
1. Add a new interface `WarmupRunnable` that allows for leader election runnables to specify
behavior when manager is not in leader mode. This interface should be as follows:
```go
type WarmupRunnable interface {
Warmup(context.Context) error
}
```

2. Controllers will implement this interface to specify behavior when the manager is not the leader.
Add a new controller option `ShouldWarmupWithoutLeadership`. If set to true, then the main
controller runnable will not start sources, and instead rely on the warmup runnable to start sources
The option will be used as follows:
```go
type Controller struct {
// ...

// ShouldWarmupWithoutLeadership specifies whether the controller should start its sources
// when the manager is not the leader.
// Defaults to false, which means that the controller will wait for leader election to start
// before starting sources.
ShouldWarmupWithoutLeadership *bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why *bool over bool?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually like to use *bool so we can differentiate between default value and whatever a user configured.

This is also necessary as we have two levels of configuration ("manager" and per controller) and we have to be able to figure out if the per-controller config overrides the manager config


// ...
}

type runnableWrapper struct {
startFunc func (ctx context.Context) error
}

func(rw runnableWrapper) Start(ctx context.Context) error {
return rw.startFunc(ctx)
}

func (c *Controller[request]) Warmup(ctx context.Context) error {
if !c.ShouldWarmupWithoutLeadership {
return nil
}

return c.startEventSources(ctx)
}
```

3. Add a separate runnable category for warmup runnables to specify behavior when the
manager is not the leader. [ref](https://github.com/kubernetes-sigs/controller-runtime/blob/v0.20.2/pkg/manager/runnable_group.go#L55-L76)
```go
type runnables struct {
// ...

LeaderElection *runnableGroup
Warmup *runnableGroup

// ...
}

func (r *runnables) Add(fn Runnable) error {
switch runnable := fn.(type) {
// ...
case WarmupRunnable:
r.Warmup.Add(RunnableFunc(fn.Warmup), nil)

// fallthrough to ensure that a runnable that implements both LeaderElection and
// Warmup interfaces are added to both groups
fallthrough
case LeaderElectionRunnable:
if !runnable.NeedLeaderElection() {
return r.Others.Add(fn, nil)
}
return r.LeaderElection.Add(fn, nil)
// ...
}
}
```

4. Start the non-leader runnables during manager startup.
```go
func (cm *controllerManager) Start(ctx context.Context) (err error) {
// ...

// Start the warmup runnables
if err := cm.runnables.Warmup.Start(cm.internalCtx); err != nil {
return fmt.Errorf("failed to start other runnables: %w", err)
}

// ...
}
```

## Concerns/Questions
1. Controllers opted into this feature will break the workqueue.depth metric as the controller will
have a pre filled queue before it starts processing items.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, one way to avoid this is to use a metrics wrapper that supresses them until the leader election is won. But not sure if its worth bothering

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually break the metric? Sounds like the metric will just show the reality

It might break alerts that assume the queue length should be pretty low, but that's an issue of the alerts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might break alerts that assume the queue length should be pretty low, but that's an issue of the alerts.

Not sure I quite agree with that. The alerts work today, if we change the behavior here, we break them. To my knowledge there also isn't a metric that indicates if a given replica is the leader, so I don't even see a good way to unbreak them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but the current definition of the metric is that it should show the length of the queue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge there also isn't a metric that indicates if a given replica is the leader

That could be solved, I hope :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • logs via logState
  • other workqueue metrics: adds_total, queue_duration_seconds
    • Although I guess we can also fake these. What would happen when the controller starts? I assume we would set the length metric immediately to it's correct value. Similar for adds_total and probably also queue_duration_seconds

I also think folks have programmatic access to the queue (at least by instantiating the queue in controller.Options.NewQueue)

So we don't know what kind of things folks are doing with the queue, e.g. accessing queue length or taking items out of the queue even if the leader election controllers are not running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we maybe shouldn't store the items in the queue at this time because that's observable behavior as well (not only through the metric) and not just make it look like the queue is empty through the metric

That makes sense to me. I guess reading between the lines, the suggestion here is that non-leader runnables should The specific behavior I want to solve for is to have the queue ready to go asap after leader election, so would we implement something like "populate queue from cache on winning leader election"? I am not too familiar with the details here so would appreciate some concrete nudges in the right direction

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. I guess reading between the lines, the suggestion here is that non-leader runnables should The specific behavior I want to solve for is to have the queue ready to go asap after leader election, so would we implement something like "populate queue from cache on winning leader election"?

I don't know how that would be possible, we would need to implement a dummy queue to buffer.

Lets punt on this problem until we make this the default behavior, we don't have to solve it right away. For as long as its opt-in and there is a metric indicating if a given replica is a leader, it won't break anyone by default and people can adjust their alerts accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought more about this. Probably the initial suggestion is fine. We should just make sure that the metrics popping up once the controller becomes leader make sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-filled queue can conflict with the priority queue feature. When the priority queue is enabled, a controller that has just restarted will initially be filled with low-priority items. However, while running as a non-leader, events for those items will reset their priority to 0. As a result, once this instance becomes the leader, the benefits of the priority queue are diminished and may even completely disappear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, while running as a non-leader, events for those items will reset their priority to 0

I assume because these objects are getting updated sooner or later and with a regular update event the priority will be increased to 0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right.

Copy link
Member

@sbueringer sbueringer Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @alvaroaleman (I think that is expected behavior as of now, but still worth thinking about)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good point and I agree this is a problem, but it is not clear to me how we could solve it. Have a wrapper around the priority queue that reduces the priority of all adds to lowPriority until the current replica is leader?

Copy link
Member

@sbueringer sbueringer Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe something like that. Or change all items in all queues to low priority when a replica becomes leader.

Your option is probably less racy.

I guess in general we would like to overwrite the priority with low priority even if a custom handler wanted to set a different priority?

2. Ideally, non-leader runnables should block readyz and healthz checks until they are in sync. I am
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break conversion webhooks. I don't know if there is a good way to figure out if the binary contains a conversion webhook, but if in doubt we have to retain the current behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this breaks conversion webhooks? Taking a step back, why does controller-runtime offer a way to run conversion webhooks in the first place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this breaks conversion webhooks?

Because they need to be up to sync the cache, so blocking readyz until the cache is ready creates a deadlock.

Taking a step back, why does controller-runtime offer a way to run conversion webhooks in the first place?

Because they are part of some controllers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, we can just put the burden on the user to register readyz as they want then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially we can provide a util func like we did with mgr.GetWebhookServer().StartedChecker() to make it easier (can be done in a follow-up PR)

But agree we can't add this per default as it can lead to deadlocks for controllers that serve conversion webhooks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in #3192, this has not yet been fully implemented.
cm.runnables.Warmup.Start() launches a goroutine for each runnable to invoke runnable.Start() and then returns once all known runnables have been started. After that, the leaderElector is started. In other words, at present, there is no guaranteed ordering between leader election and the completion of cache synchronization.

Copy link
Contributor

@zach593 zach593 Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key point is that passing in the ready parameter is not currently implemented.

if err := r.Warmup.Add(RunnableFunc(warmupRunnable.Warmup), nil); err != nil {

If the check only passes after the warmup is completed, and before that it blocks cm.runnables.Warmup.Start(), then the idea of blocking leader election would be feasible.

if rn.Check(r.ctx) {
if rn.signalReady {
r.startReadyCh <- rn
}
}

// Wait for all runnables to signal.
for {
select {
case <-ctx.Done():
if err := ctx.Err(); !errors.Is(err, context.Canceled) {
retErr = err
}
case rn := <-r.startReadyCh:
for i, existing := range r.startQueue {
if existing == rn {
// Remove the item from the start queue.
r.startQueue = append(r.startQueue[:i], r.startQueue[i+1:]...)
break
}
}
// We're done waiting if the queue is empty, return.
if len(r.startQueue) == 0 {
return
}
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I think you're right

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After that, the leaderElector is started. In other words, at present, there is no guaranteed ordering between leader election and the completion of cache synchronization.

Yes, this is correct but I think your suggestion of a readyness check means that we won't even start non-leader election runnables before the warmups have finished, which I think is undesirable. Not starting leader-election until it has finished seems fine, but might slow down things in case there was no passive replica, because acquiring the leader lease takes a bit of time.

The other thing this makes me realize is that in order to have rollouts where the cache sync is not part of the downtime we would need to integrate the warmup with a healthcheck, such that the healthcheck only passes once warmup has completed, otherwise what could happen is:

  1. new pod n1 gets created
  2. Old pod o1 gets removed and gives up the leader lease
  3. leader lease gets acquired by old pod o2, as n1 hasn't finished syncing its cache
  4. new pod n2 gets created
  5. old pod o2 gets shutdown - If neither n1 or n2 hasn't synced yet, we will again have some cache sync induced downtime

The issue then however is that delaying the healthcheck until syncing finished poses a problem for anyone who uses conversion webhooks, as conversion webhooks may be required to be able to sync, causing a deadlock.

We could probably add a second knob to control if the warmup should be part of the healthcheck and then this can be used by anyone who doesn't have conversion webhooks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not starting leader-election until it has finished seems fine, but might slow down things in case there was no passive replica, because acquiring the leader lease takes a bit of time.

I’m describing this requirement because, in our scenario, we need to strictly control downtime, and synchronizing the cache usually takes several minutes. Therefore, a delay of just a few seconds caused by leader election is entirely acceptable.


leader lease gets acquired by old pod o2, as n1 hasn't finished syncing its cache

It still seems to assume that n1 will only attempt to acquire the lease after it has finished syncing the cache.

Warmup, healthcheck, and leader election need to be combined together in order to eliminate downtime during rollouts. If only warmup and healthcheck are used, the following situation can occur:

  1. new pod n1 is created, completes cache synchronization, and becomes ready.
  2. old pod o1 is removed and gives up the leader lease.
  3. new pod n2 is created, its cache has not yet finished syncing, and it starts to participate in leader election.
  4. the new pod n2 becomes the leader, but its cache has not yet finished syncing, leading to downtime.

not sure what the best way of implementing this is, because we would have to add a healthz check
that blocks on WaitForSync for all the sources started as part of the non-leader runnables.
3. An alternative way of implementing the above is to moving the source starting / management code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is kind-of already the case, the source uses the cache which is a runnable to get the actual informer and the cache is shared and started before anything else except conversion webhooks (as conversion webhooks might be needed to start it). The problem is just that the cache does not actually cache anything for resources for which no informer was requested and that in turn only happens after the source is started which happens post controller start and thus post leader election

out into their own runnables instead of having them as part of the controller runnable and
exposing a method to fetch the sources. I am not convinced that that is the right change as it
would introduce the problem of leader election runnables potentially blocking each other as they
wait for the sources to be in sync.