Skip to content

Conversation

@D3luxee
Copy link

@D3luxee D3luxee commented Aug 7, 2025

This PR is aimed to resolve #1169 by implementing a retry logic of the drain function call with the drainer having a timeout by itself.

The globalDrainTimeout works together with the drainTimeout where drainTimeout will be the timeout for the drainer and globalDrainTimeout is the timeout for trying to drain the node after which it will abort draining and release the lock.

This should keep the same logic for all existing flags and only if globalDrainTimeout gets set it will change the behavior of kured.

Signed-off-by: Georg Doser <georg.doser@nitrado.net>
@evrardjp evrardjp added FEATURE-v2 This is a feature improvement that needs to be taken into consideration for v2 enhancement This was triaged as an enhancement labels Aug 11, 2025
@evrardjp evrardjp removed the FEATURE-v2 This is a feature improvement that needs to be taken into consideration for v2 label Aug 19, 2025
@evrardjp
Copy link
Collaborator

evrardjp commented Aug 19, 2025

This is far too impactful in the long run of the project to not have tests. We'll need some tests to guarantee the behaviour.

The initial issue is easy to test and guarantee a good behaviour. I don't think we need more than that for v1.

I am however on the fence on the implementation: retrying drains on the node is enough, do we need the complexity around the timeouts? What's the reason you came to the conclusion that it was the best?

@D3luxee
Copy link
Author

D3luxee commented Sep 3, 2025

I agree on the testing part, i just pushed my changes i made as we are currently running kured with these patches in our production environment.

retrying drains on the node is enough, do we need the complexity around the timeouts? What's the reason you came to the conclusion that it was the best?

Which complexity do you mean? Do you refer to the globalDrainTimeout and why there are two behaviors?

I wanted to preserve the old behavior of drainTimeout and added globalDrainTimeout which will alter this behavior then.

Generally the behavior i wanted is that kured retries the drain call to update the pod list without it uncordning the node which is what happens if you just set drainTimeout.

I could also add the retry logic to the drain() function so that its generally applied without changing anything else.

I avoided altering the general logic too much to get this merged more easily, but i can implement it differently as well if you want.

@evrardjp evrardjp added the FEATURE-v2 This is a feature improvement that needs to be taken into consideration for v2 label Sep 20, 2025
@github-actions
Copy link

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

@github-actions github-actions bot closed this Dec 13, 2025
@evrardjp evrardjp added keep This won't be closed by the stale bot. and removed no-pr-activity labels Dec 13, 2025
@evrardjp evrardjp reopened this Dec 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement This was triaged as an enhancement FEATURE-v2 This is a feature improvement that needs to be taken into consideration for v2 keep This won't be closed by the stale bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Draining gets stuck for pods that have a static name

2 participants