batch job queue: adds initial empty implementation by mismithhisler · Pull Request #27841 · hashicorp/nomad

mismithhisler · 2026-04-16T18:39:22Z

Description

These changes add an initial draft implementation of an empty batch job queue, to include basic queues core data structures and ability to watch evals for job placement. In order to facilitate easier reviews, a lot of implementation has been left for further PR's.

Testing & Reproduction steps

Links

Contributor Checklist

Changelog Entry If this PR changes user-facing behavior, please generate and add a
changelog entry using the make cl command.
Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
ensure regressions will be caught.
Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
and job configuration, please update the Nomad product documentation, which is stored in the
web-unified-docs repo. Refer to the web-unified-docs contributor guide for docs guidelines.
Please also consider whether the change requires notes within the upgrade
guide. If you would like help with the docs, tag the nomad-docs team in this PR.

Reviewer Checklist

Backport Labels Please add the correct backport labels as described by the internal
backporting document.
Commit Type Ensure the correct merge method is selected which should be "squash and merge"
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
within the public repository.

If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

tgross

This is great work @mismithhisler. The only potentially blocking concern for me here is the waitForPlacement blocking forever.

tgross · 2026-05-05T13:39:02Z

+	old := *pq
+	n := len(old)
+	item := old[n-1]
+	old[n-1] = nil  // don't stop the GC from reclaiming the item eventually


Do we actually need this? I'm pretty sure the old slice goes out of scope at the end of this function and the *pq = old[0:n-1] is copying its (ptr, len, cap)`, not its contents. https://go.dev/play/p/xE0glBdR8O6

Maybe I'm missing something?

Admittedly I took this directly from the container/heap priorityQueue example here.

I'll take a deeper look at this today.

I did some quick digging and found golang/go#65403 and golang/go#65404 which leads to this Gerrit discussion https://go-review.googlesource.com/c/go/+/559775

I guess I'm wrong?

https://go.dev/play/p/7hR-IWTT7J9 shows it: the old pointer still lives in the backing array. If you uncomment the old[n-1] = nil and re-run this, it'll show as 0x0 (nil)

Yeah I was wondering if this has to do with the fact the capacity of the slice is still the same, so it's probably still holding onto that pointer, even though it doesn't appear so?

tgross · 2026-05-05T13:51:14Z

+// to an internal channel to be processed and added to the actual
+// heap container.
+func (d *DynamicPriorityQueue) Enqueue(e *structs.Evaluation) {
+	w := d.generateWorkload(e)


I realize this isn't wired-up yet but do we imagine we'll return the empty workload here if this is a non-batch job, or will we just not call Enqueue for those in the first place?

I haven't completely figured out what the best way to "route" these evaluation is yet. I was thinking that only batch jobs would be routed to this Queue, and then if for example, they didn't have the required metadata flag (if metadata was set), then we would just pass to the eval broker?

Open to ideas here though.

tgross · 2026-05-05T13:59:37Z

+			// Wait for the eval to be placed
+			d.waitForPlacement(ctx, workload.eval)


Suppose a job author writes a job that can't ever be placed because they screwed up a constraint. Doesn't this end up blocking forever and preventing any further jobs from being enqueued to the eval broker? Do we need some way of "abandoning" a workload in this queue or otherwise saving it to be retried later?

Also, we don't do anything with the error returned from this.

Yeah at the moment this would block forever until the job was stopped and eval was marked complete. We could add some configurable limit to waitForPlacements that stops the blocking query after some period of time has gone by. I'm not sure there much we can do in the way of saving it to be retried later, as once it's released to the eval broker, it's now out of our hands.

Yeah I forgot to handle this error, I'll get that updated. 😄

I'm not sure there much we can do in the way of saving it to be retried later, as once it's released to the eval broker, it's now out of our hands.

That's a good point, but also makes me realize a more fundamental issue: won't any blocked eval also end up being re-submitted to this queue? Which means we'll be waiting here and never enqueing the blocked eval into the eval broker in the first place. Right now I don't think we ever unblock in the case of a blocked eval. We should probably add a test that covers this workflow.

We would filter out any evals that are not Eval.TrriggeredBy == EvalTriggerJobRegister, so we should be good there. But it will get a little complex with new versions of jobs, but I'm just trying to get a basic queue in here.

Hopefully next I'll start wiring it up, and working on state restore, which also will be a little complex because of leadership transfers.

tgross · 2026-05-05T14:19:30Z

+	// conf contains user configurations for tuning the behavior of the queue
+	conf *DynamicPriorityConfig


Definitely a TBD, but do we think we're just going to blow away the whole queue if this configuration gets changed via API?

Yeah I think just getting a solid "restore the queue from state" functionality, and relying on it for any conf changes is probably the best way to go at least for now.

tgross · 2026-05-05T14:22:07Z

+		select {
+		case <-doneCh:
+			t.Fatal("should not have exited")
+		default:
+		}


Strictly speaking this doesn't reliably exercise the desired behavior: waitForPlacement could still be on its first pass through the loop and not have had an opportunity to incorrectly return. If waitForPlacement were buggy this would be flaky rather than always fail.

I think the next subtest has the same problem?

Yeah you're absolutely right, I'll look into a better way to write this test.

Added a wait for the new test watchset which guarentees that the goroutine has begun it's actual blocking query before we upsert an eval update.

Co-authored-by: Tim Gross <tim@0x74696d.com>

mismithhisler self-assigned this Apr 16, 2026

mismithhisler changed the base branch from main to f-batch-job-queue April 16, 2026 18:39

mismithhisler added 2 commits May 4, 2026 15:44

batch job queue: adds initial empty implementation

bf60979

rebase and refactor method name

ac69bb6

mismithhisler force-pushed the f-bjq-initial-queue-impl branch from a0ae725 to ac69bb6 Compare May 4, 2026 19:48

vercel Bot deployed to Preview May 4, 2026 19:49 View deployment

add headers and ci testing

4c1f681

vercel Bot deployed to Preview May 4, 2026 19:56 View deployment

mismithhisler marked this pull request as ready for review May 4, 2026 20:26

mismithhisler requested review from a team as code owners May 4, 2026 20:26

fix comments and add some godocs

4d54528

vercel Bot deployed to Preview May 4, 2026 20:39 View deployment

mismithhisler requested review from schmichael and tgross May 4, 2026 20:41

mismithhisler commented May 5, 2026

View reviewed changes

Comment thread nomad/queues/batch_job_queue.go Outdated

tgross reviewed May 5, 2026

View reviewed changes

Apply suggestions from code review

053d30d

Co-authored-by: Tim Gross <tim@0x74696d.com>

vercel Bot deployed to Preview May 5, 2026 15:26 View deployment

refactor waitForPlacement and update tests

ebe1ec2

vercel Bot deployed to Preview May 5, 2026 18:45 View deployment

remove old watches

8b92e89

vercel Bot deployed to Preview May 5, 2026 18:53 View deployment

fix import

06e5087

vercel Bot deployed to Preview May 5, 2026 18:56 View deployment

fix comments

1139963

vercel Bot deployed to Preview May 5, 2026 19:46 View deployment

more comment undates

175dc12

vercel Bot deployed to Preview May 5, 2026 19:47 View deployment

check waitForPlacement err

2e078d6

vercel Bot deployed to Preview May 5, 2026 22:16 View deployment

fix imports

d06ff35

vercel Bot deployed to Preview May 6, 2026 13:18 View deployment

		// Wait for the eval to be placed
		d.waitForPlacement(ctx, workload.eval)

		// conf contains user configurations for tuning the behavior of the queue
		conf *DynamicPriorityConfig

Conversation

mismithhisler commented Apr 16, 2026

Description

Testing & Reproduction steps

Links

Contributor Checklist

Reviewer Checklist

Changes to Security Controls

Uh oh!

Uh oh!

tgross left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mismithhisler May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tgross left a comment •

edited

Loading

mismithhisler May 5, 2026 •

edited

Loading