Skip to content

Conversation

@colinmcintosh
Copy link

Add metrics:

  • kured_lock_annotation
  • kured_lock_held
  • kured_node_draining
  • kured_reboot_blocked
  • kured_reboot_window_active

This also reduces the interval which metrics are collected from 60s to 15s (4x increase). This reduces the chance of lost metrics during a mismatch of the metrics collection frequency and the metrics scrape frequency.

…ing, kured_reboot_blocked, and kured_reboot_window_active.

Signed-off-by: Colin McIntosh <colin@colinmcintosh.com>
@evrardjp
Copy link
Collaborator

evrardjp commented Aug 1, 2025

Good idea. Let's talk about it in the next community meeting.

@evrardjp evrardjp added enhancement This was triaged as an enhancement FEATURE-v2 This is a feature improvement that needs to be taken into consideration for v2 labels Aug 1, 2025
@localleon
Copy link

This would solve some of my concerns in #1156. How does this approach handle conflicting values for the lock_annotation? Won't the potentially different update interval of each pod not result in lock_annotation.node beeing reported differently by all pods? e.g. Node1 takes the lock -> all pods scrape -> Node 2 takes the lock -> first pod scrapes, but all other pods wait 2 secs for next scrape -> Prometheus metric is incosistent?

@localleon
Copy link

@evrardjp what would be needed to get this merged?

@colinmcintosh
Copy link
Author

@localleon thanks for the reminder. Regarding the potential to have inconsistent metrics reported, I did consider that but I think the current implementation provides visibility for kured pods that aren't correctly tracking the lock holder. That should ideally never happen so I'm not convinced that this is the correct implementation. Absolutely open to alternatives if you have strong feelings about it.

@evrardjp happy to discuss in the next community meeting or an adhoc call. We can coordinate in the CNCF slack channel on it if you'd like.

@localleon
Copy link

@evrardjp hope your doing well! Is there anything I could do so we could get this feature merged?

@evrardjp
Copy link
Collaborator

I am in the middle of the big rewrite of v2 (you can see the first steps in #1000).
V2 is also documented in our channel, but I guess I should put a project in github ...

Anyway, long story short, most of those metrics won't make sense anymore.
If you're looking into observability, v2 will bring other methods to see what's going on.

Let's go through each metric:

  • kured_lock_annotation --> We won't need locks anymore: There will be a shared queue or a lease system between rebooters
  • kured_lock_held --> Same comment
  • kured_node_draining --> You will be able to observe this through node conditions
  • kured_reboot_blocked --> You will be able to observe this through node conditions
  • kured_reboot_window_active --> This is something I need to figure out with you. Would you rather have something on the node that says "reboot window is active for this node" (condition/metric/..) or would you prefer that maintenances have their own CRD, which report such status? This was not decided in the v2 document.

Any opinion on this @localleon @colinmcintosh ?

@localleon
Copy link

Thanks @evrardjp for the detailled reply! I assumed the PR #1000 was inactive and did not realize that there was a v2 in the works for this project! A Github Project board would probaly be a good idea if your are looking for contributions!

The metrics look good! About kured_reboot_window_active..

  • I currently like the simplicity of kured and that it's not using CRDs. I think if possible, i would keep it this way!
  • A metric on each node that says reboot windows is active for this node would be great! From my current understanding we can only configure reboot windows for all nodes at the same time. There are i think two options:
    • Indicate per node with a 0/1 metric if it's currently able to accept a reboot command
    • Make a global metric with 0/1 that indicates if all nodes are currently rebootable (this would make more sense since we do not have individual reboot windows)

@github-actions
Copy link

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

@dharsanb dharsanb added keep This won't be closed by the stale bot. and removed no-pr-activity labels Dec 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement This was triaged as an enhancement FEATURE-v2 This is a feature improvement that needs to be taken into consideration for v2 keep This won't be closed by the stale bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants