Add new kured metrics #1164

colinmcintosh · 2025-07-24T04:22:33Z

Add metrics:

kured_lock_annotation
kured_lock_held
kured_node_draining
kured_reboot_blocked
kured_reboot_window_active

This also reduces the interval which metrics are collected from 60s to 15s (4x increase). This reduces the chance of lost metrics during a mismatch of the metrics collection frequency and the metrics scrape frequency.

…ing, kured_reboot_blocked, and kured_reboot_window_active. Signed-off-by: Colin McIntosh <colin@colinmcintosh.com>

evrardjp · 2025-08-01T05:56:01Z

Good idea. Let's talk about it in the next community meeting.

localleon · 2025-08-25T12:35:17Z

This would solve some of my concerns in #1156. How does this approach handle conflicting values for the lock_annotation? Won't the potentially different update interval of each pod not result in lock_annotation.node beeing reported differently by all pods? e.g. Node1 takes the lock -> all pods scrape -> Node 2 takes the lock -> first pod scrapes, but all other pods wait 2 secs for next scrape -> Prometheus metric is incosistent?

localleon · 2025-09-08T11:24:00Z

@evrardjp what would be needed to get this merged?

colinmcintosh · 2025-09-08T11:41:12Z

@localleon thanks for the reminder. Regarding the potential to have inconsistent metrics reported, I did consider that but I think the current implementation provides visibility for kured pods that aren't correctly tracking the lock holder. That should ideally never happen so I'm not convinced that this is the correct implementation. Absolutely open to alternatives if you have strong feelings about it.

@evrardjp happy to discuss in the next community meeting or an adhoc call. We can coordinate in the CNCF slack channel on it if you'd like.

localleon · 2025-10-16T14:51:05Z

@evrardjp hope your doing well! Is there anything I could do so we could get this feature merged?

evrardjp · 2025-10-16T21:18:08Z

I am in the middle of the big rewrite of v2 (you can see the first steps in #1000).
V2 is also documented in our channel, but I guess I should put a project in github ...

Anyway, long story short, most of those metrics won't make sense anymore.
If you're looking into observability, v2 will bring other methods to see what's going on.

Let's go through each metric:

kured_lock_annotation --> We won't need locks anymore: There will be a shared queue or a lease system between rebooters
kured_lock_held --> Same comment
kured_node_draining --> You will be able to observe this through node conditions
kured_reboot_blocked --> You will be able to observe this through node conditions
kured_reboot_window_active --> This is something I need to figure out with you. Would you rather have something on the node that says "reboot window is active for this node" (condition/metric/..) or would you prefer that maintenances have their own CRD, which report such status? This was not decided in the v2 document.

Any opinion on this @localleon @colinmcintosh ?

localleon · 2025-10-17T12:01:00Z

Thanks @evrardjp for the detailled reply! I assumed the PR #1000 was inactive and did not realize that there was a v2 in the works for this project! A Github Project board would probaly be a good idea if your are looking for contributions!

The metrics look good! About kured_reboot_window_active..

I currently like the simplicity of kured and that it's not using CRDs. I think if possible, i would keep it this way!
A metric on each node that says reboot windows is active for this node would be great! From my current understanding we can only configure reboot windows for all nodes at the same time. There are i think two options:
- Indicate per node with a 0/1 metric if it's currently able to accept a reboot command
- Make a global metric with 0/1 that indicates if all nodes are currently rebootable (this would make more sense since we do not have individual reboot windows)

github-actions · 2025-12-20T02:14:04Z

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

Add metrics: kured_lock_annotation, kured_lock_held, kured_node_drain…

c431fa2

…ing, kured_reboot_blocked, and kured_reboot_window_active. Signed-off-by: Colin McIntosh <colin@colinmcintosh.com>

evrardjp added enhancement This was triaged as an enhancement FEATURE-v2 This is a feature improvement that needs to be taken into consideration for v2 labels Aug 1, 2025

github-actions bot added the no-pr-activity label Dec 20, 2025

dharsanb added keep This won't be closed by the stale bot. and removed no-pr-activity labels Dec 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add new kured metrics #1164

Add new kured metrics #1164

Uh oh!

colinmcintosh commented Jul 24, 2025

Uh oh!

evrardjp commented Aug 1, 2025

Uh oh!

localleon commented Aug 25, 2025

Uh oh!

localleon commented Sep 8, 2025

Uh oh!

colinmcintosh commented Sep 8, 2025

Uh oh!

localleon commented Oct 16, 2025

Uh oh!

evrardjp commented Oct 16, 2025

Uh oh!

localleon commented Oct 17, 2025

Uh oh!

github-actions bot commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add new kured metrics #1164

Are you sure you want to change the base?

Add new kured metrics #1164

Uh oh!

Conversation

colinmcintosh commented Jul 24, 2025

Uh oh!

evrardjp commented Aug 1, 2025

Uh oh!

localleon commented Aug 25, 2025

Uh oh!

localleon commented Sep 8, 2025

Uh oh!

colinmcintosh commented Sep 8, 2025

Uh oh!

localleon commented Oct 16, 2025

Uh oh!

evrardjp commented Oct 16, 2025

Uh oh!

localleon commented Oct 17, 2025

Uh oh!

github-actions bot commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants