core: timeouts for batch jobs#27803
Conversation
|
hey @schmichael @mismithhisler, thanks a lot for the comments! Many of the things you pointed out was messy code from the many iterations of this branch. Apologies, should've cleaned it up better. But amongst other things, I simplified the I think the main issue that remains is how to approach task-level state. In its current shape, this code always waits for tasks to start. This is good, but causes issues with task lifecycle events, as @schmichael pointed out. It also means After doodling a bit with a more sophisticated solution, I am slowly leaning towards a more brutal one: make the timer start immediately in the alloc runner, regardless of task state. What do you think about this? |
| ar.taskCoordinator.TaskStateUpdated(states) | ||
|
|
||
| // Get the client allocation | ||
| calloc := ar.clientAlloc(states) |
There was a problem hiding this comment.
To your point about this being tough because we have to wait for tasks to start. Doesn't this function do all the ugly "check to make sure all tasks are running" logic?
I'm curious if we can just pass the calloc.ID and calloc.ClientStatus to your MaxRunDuration object and have it just do the "yeah this alloc is now running, start a timer" logic
|
#27827 (branches off of this branch) implements the simplified version of What happens when we run it on
I am becoming more and more convinced this is a good direction. |
^ + your followup comment seem exciting to me. I'm EOD here but will review the other PRs tomorrow (and check out your internal demo recording). I can't think of a reason one approach would be more surprising than the other (to try to let least astonishment make our decision for us). |
Working on this problem I've been trying to explore "modular" solutions. In my mind, we could easily ship a "1.0" version of this that keeps things very simple, starting the timer in the allocrunner, having no regard for post-stop tasks (ok maaaybe this could be a switch), and just giving users the option to set timeouts for their task groups. We can see how the response from the community is, and later offer a more fine-grained set of knobs with a task-level I think it's the interaction between the tg-level and task-level setting that's the hardest part to get right in this feature. |
|
The CLI and UI should display the deadline in When updating unified-web-docs we should also make sure to link from |
yeah I'd rather this being separate PRs if you don't mind? I need more commits on |
This changeset introcudes
max_run_durationtask group configuration variablefor
batchandsysbatchjobs. It's enforced in the alloc runner by a max runhook. If the timer is up:
completeand with:dead.The
max_run_durationtimer starts in the allocation runner regardless of taskstates. That is, tasks that take longer to start than the timeout get
terminated. This is to avoid long-starting tasks to run too long.
Lifecycle tasks count to the
max_run_durationtimer.prestart,poststarttasks runtime count into the overall run time of the taskgroup, and
poststoptasks will not be started if the task was terminated due to running out of its
allocated time.
Two new metrics are emitted:
client.allocs.max_run_duration.configured_secondsclient.allocs.max_run_duration.remaining_secondsResolves #1782
Supersedes #18456
Internal ref: https://hashicorp.atlassian.net/browse/NMD-551