[WIP] Cleanup Part 2 - Rewrite #1000

evrardjp · 2024-10-18T18:52:04Z

This is the follow up PR to #990 .
It tackles deeper changes to reach a 2.0 of the rewrite.

What it does:

The main goroutine, cleaning up the code for easier maintenance
Remove the usage of logrus, as we can simply use slog
Makes the system more unit testable.

Still to do (in another PR):

Ensure logs are out for all the loops.
Be generic on the evacuator
The split into different components for v2
Take a stance on simplification (remove lock TTLs, annotate by default)

evrardjp · 2024-10-27T20:47:08Z

This will need a rebase, and reimplementation. But the idea is there. I will update this next week if I have time

evrardjp · 2024-11-01T07:15:40Z

I need to reimplement this from scratch, please ignore my changes from now, until I mention this is ready for review (and [WIP] flag will be removed)

github-actions · 2025-01-11T01:59:37Z

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

evrardjp · 2025-01-20T12:24:26Z

should I bother revive this, @jackfrancis ?
This was the place for large refactoring/cleaning of the code, to be able to replace some implementations by others.
The intent was to be able to use another notifyer (replace shoutrrr), or implement different lock mechanism.
With this in mind, I completely cleaned up the lock code, fixing a long standing security point, but at the expense of breaking the current behaviour.

I am fine to resume this work, but only if we'll eventually merge it. No point for me to always rebase this massive change if it doesn't get in.

github-actions · 2025-05-10T02:06:24Z

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

evrardjp · 2025-06-09T13:53:18Z

I will need to reopen this and work on it, after a conversation about our direction.

Without this patch, one metric could say "reboot is required" while the rebootAsRequired tick did not run (long period for example). This is a problem, as it leads to misexpectations: "Why did the system not reboot, while the metrics indicate a reboot was required". This solves it by inlining the metrics management within the rebootAsRequired goroutine. Closes: kubereboot#725 Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

No reason to have all the internal tools (timewindows, taints, locks) as public interfaces. Should someone be crazy to rebuild its own kured, then they would need to load our high level tools: the checkers/blockers/rebooters. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

This patches simplifies the principal loop: rebootAsRequired - Previous patch "Fix discrepency metrics <> rebootAsRequired" changed the behaviour of the metrics, ensuring the metrics and the checks for reboots are aligned. However, this meant that the metrics would not be correct for long periods of time if the period is high. This patches therefore clarifies the ticking purpose to avoid users to set large values, as we have seen in multiple questions in the past (ppl were missing that part of the documentation, and could find the name "period" a misnomer). - With the ticking clarified, it was the opportunity to remove the randomization bit. It was very unlikely, even on large clusters, that calls would trigger a storm on the API server. Similarly, the delay was not bringing any value. By removing those two items, the code can be simplified futher by removing the delaytick and its random seed. - However, by ticking more frequently, we now have more calls on the API server, which might have an impact on large clusters. In order to fix that, I reschuffled the loop to quickly know if a reboot is required at each tick, which helps a bit reducing some of the calls. This can be further improved by sharing node/ds info through the calls in the tick in order to make the process more efficient. - A new struct GenericRebooter was introduced to keep data all the rebooters need. For example, here, it keeps the rebootDelay imposed to the reboot function. Each rebooter can implement its own reboot method, but to simplify there the generic rebooter implements a DelayReboot() method, which is then called from all the Reboot methods of the concrete types. - On the way, this commit also changed the RebootBlocked function, to also return the list of the names of (types) blocking the reboot. This is useful to debug, but also can be useful to later expose the content of the blocker into specific metrics. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

Without this, there is no way to tell, by looking at the metrics how much reboot attempts were done, or why they were blocked. This is a problem, as only logs are currently usable to expose the blockers. This fixes it by introducing a new vector holding as labels the node names and the reason of the block. For this, all the blockers have to implement another method in the RebootBlocker interface, namely MetricLabel(), to generate a string containing the block reason. To avoid high cardinality, this is basically a static name based on the struct. I did not choose to use a Stringer interface, as I feel the Stringer interface would make more sense for the blocker data itself (including its specific details, for example pod selector string, etc.). Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

Wthout this, it will not be possible to use unprivileged commands. This is a problem, as one might want to run a command without the nsenter to pid 1. This fixes it by exposing this to main, the only thing remaining is to use a boolean to expose the feature in a later commit. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

delaytick is not used anymore. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party> # Conflicts: # internal/delaytick/delaytick.go # internal/validators.go

Without this, we are using logrus and the logs are inconsistent. This fixes it by moving to slog, and reviewing all the logs. I added multiple comments to clarify the intent behind the logs. With structured logging, extra data, such as the node, is exposed whenever possible, which makes the logging more consistent. We are not yet using contextes for ticks, it should be done in a later commit. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

signal is named SIGRTMIN not SIGTRMIN. It can be confusing for people. As this could also lead to questions, the code clarifies a TODO item: We are still relying on hardcoded ints, instead of evaluating the signal number at run time, as recommended by man (7) signal. Please read man (7) signal for further details. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

The current code repeats itself and relies on global vars. This is a problem, as changing the notifications will move code all over, and actually prevents us to use a simple Send() method for all our notifications. This opens the door to a simple sender Sending events to multiple places (webhooks, logs, and a notification service) Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

Without this, we will rely on old urls forever. It has been multiple cycles we removed it, we are now in a good place to remove the vars. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

Now that kured is a community project under the CNCF, it might be appropriate to use the official kured name, instead of the old weave.works name. As this is intrusive, there was no occasion to do it before. Because other commits in this branch are intrusive (change of main loop, removing old flags, ...) this is the occasion to clean up the cruft. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

Without this, the flags are becoming a mess. This cleans up the order. Not very important. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

Without this, we use the daemonset lock for annotating the state of the node. This is unnecessary, as we have rights on the node. This is a problem, as it makes the whole lock model more complex, and prevents other implementations for lock. This fixes it by adapting the drain/uncordon functions to rely on the node annotations, by introducing a new annotation, named "kured.dev/node-unschedulable-before-drain" Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

This reduces the global variables, and regroups all the operations in a single place. This will allow further refactor to represent all the k8s operations kured needs on a single node. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

Without this patch, while developping (on uncommitted work), the image with previous SHA gets overriden, which leads to mishaps. This fixes it by ensuring only the kured:dev tags (full path and short one) are used everywhere. At the same time, it cleans up the main_test to be more flexible by passing more of the main features as options. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

Without this, kured:dev image is used. This is a problem for podman, as podman will automatically replace kured:dev with docker.io/library/kured:dev and this image is not loaded into the kind environment, resulting in CrashLoopBackoff when trying to pull the docker.io/library/kured:dev image. This fixes it by using a full kured path. This is a stop gap before cleaning up the Makefile of all development hacks (that should be in the test framework instead) Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

Without this, each test is generating its own static file. This is a problem: - It is brittle and hard to track what is actually used in each test - files will proliferate with tests, especially in our v2 model - it is harder to patch from code. This fixes it by taking a more idiomatic go approach. This cleans up the tests to: - generate all manifests from go code - be ready to support multiple DS for v2 - be ready to support multiple runner types (not only kind) in the future. The shell scripts will need to be cleaned up and converted into go code in another patch. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

Without this, we will need to repeat the pflag handling in each cmd. This is a problem, as it will lead to discrepencies into the handling of env vars or cli flags. This fixes it by moving to an internal package, allowing them to be reused across the different CLIs. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

github-actions · 2025-12-20T02:14:06Z

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

evrardjp force-pushed the second_rewrites branch from cc45f87 to a8e36f0 Compare October 18, 2024 21:17

evrardjp force-pushed the second_rewrites branch 7 times, most recently from c86971c to d8285de Compare November 7, 2024 21:15

evrardjp force-pushed the second_rewrites branch 3 times, most recently from b6d3ba5 to 0df8f7c Compare November 11, 2024 20:03

github-actions bot added the no-pr-activity label Jan 11, 2025

github-actions bot removed the no-pr-activity label Jan 21, 2025

evrardjp force-pushed the second_rewrites branch 2 times, most recently from ea7ee6a to 0b99b3e Compare February 4, 2025 07:56

github-actions bot added the no-pr-activity label May 10, 2025

github-actions bot closed this May 31, 2025

evrardjp reopened this Aug 30, 2025

github-actions bot removed the no-pr-activity label Sep 6, 2025

evrardjp added the FEATURE-v2 This is a feature improvement that needs to be taken into consideration for v2 label Sep 20, 2025

evrardjp force-pushed the second_rewrites branch from 0b99b3e to 6937d90 Compare October 7, 2025 22:09

pando85 mentioned this pull request Oct 8, 2025

Uncordon immediately after drain failure #1129

Open

evrardjp mentioned this pull request Oct 12, 2025

build(deps): bump github/codeql-action from 3.30.6 to 4.30.8 #1232

Closed

evrardjp added 17 commits October 12, 2025 16:42

Remove unnecessary functions

56097a3

delaytick is not used anymore. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party> # Conflicts: # internal/delaytick/delaytick.go # internal/validators.go

Remove deprecated notifications URLs

358fc21

Without this, we will rely on old urls forever. It has been multiple cycles we removed it, we are now in a good place to remove the vars. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

Fix flags order

4ec3cb9

Without this, the flags are becoming a mess. This cleans up the order. Not very important. Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

Fix minor details to pass a stricter linter

d60305c

Signed-off-by: Jean-Philippe Evrard <open-source@a.spamming.party>

evrardjp force-pushed the second_rewrites branch from a9cfcf3 to fca84b0 Compare October 12, 2025 15:20

evrardjp mentioned this pull request Oct 16, 2025

Add new kured metrics #1164

Open

github-actions bot added the no-pr-activity label Dec 20, 2025

dharsanb added keep This won't be closed by the stale bot. and removed no-pr-activity labels Dec 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Cleanup Part 2 - Rewrite #1000

[WIP] Cleanup Part 2 - Rewrite #1000

Uh oh!

evrardjp commented Oct 18, 2024 •

edited

Loading

Uh oh!

evrardjp commented Oct 27, 2024 •

edited

Loading

Uh oh!

evrardjp commented Nov 1, 2024

Uh oh!

github-actions bot commented Jan 11, 2025

Uh oh!

evrardjp commented Jan 20, 2025

Uh oh!

github-actions bot commented May 10, 2025

Uh oh!

evrardjp commented Jun 9, 2025

Uh oh!

github-actions bot commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP] Cleanup Part 2 - Rewrite #1000

Are you sure you want to change the base?

[WIP] Cleanup Part 2 - Rewrite #1000

Uh oh!

Conversation

evrardjp commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evrardjp commented Oct 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evrardjp commented Nov 1, 2024

Uh oh!

github-actions bot commented Jan 11, 2025

Uh oh!

evrardjp commented Jan 20, 2025

Uh oh!

github-actions bot commented May 10, 2025

Uh oh!

evrardjp commented Jun 9, 2025

Uh oh!

github-actions bot commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

evrardjp commented Oct 18, 2024 •

edited

Loading

evrardjp commented Oct 27, 2024 •

edited

Loading