Skip to content

fix(scheduler): resolve throttling on high-concurrency pod submissions#1723

Open
maishivamhoo123 wants to merge 7 commits intoProject-HAMi:masterfrom
maishivamhoo123:fix/scheduler-throttling
Open

fix(scheduler): resolve throttling on high-concurrency pod submissions#1723
maishivamhoo123 wants to merge 7 commits intoProject-HAMi:masterfrom
maishivamhoo123:fix/scheduler-throttling

Conversation

@maishivamhoo123
Copy link
Copy Markdown
Contributor

Fixes: #1367

Root Cause:
When a high number of pods (e.g., 40+) requested GPUs simultaneously, the HAMi scheduler attempted to lock the node by patching it via the Kubernetes API. The first pod succeeded, but the remaining 39 triggered HTTP 409 API conflicts. This caused the HAMi scheduler to return errors, forcing the kube-scheduler to throw those pods into a 5-minute exponential backoff queue, resulting in severe scheduling delays.

What this PR does:

  1. In-Memory Queueing: Introduces an in-memory sync.Mutex in the Bind() function to gracefully queue concurrent pod scheduling attempts locally rather than outright rejecting them.
  2. Blind API Patch: Replaces the standard K8s API node update with a Blind Patch (types.MergePatchType). This safely applies the hami.io/mutex.lock annotation (which the Device Plugin requires to identify the pod) without triggering 409 resource version conflicts.

Testing:

  • Verified that submitting jobs with 40+ parallelism schedules pods instantly back-to-back without hitting the exponential backoff queue.
  • Passed make test and make verify.

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot bot commented Apr 1, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: maishivamhoo123
Once this PR has been reviewed and has the lgtm label, please assign shouren for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot hami-robot bot requested a review from archlitchi April 1, 2026 17:20
@hami-robot hami-robot bot requested a review from DSFans2014 April 1, 2026 17:20
@hami-robot hami-robot bot added the size/L label Apr 1, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a per-node in-memory locking mechanism within the scheduler's Bind function to serialize concurrent binding requests and prevent them from triggering exponential backoff. It also adds a blind patch to node annotations for device plugin handshaking and includes a new test case to verify concurrent binding behavior. The review feedback identifies a critical deadlock vulnerability in the AcquireBindLock implementation where a timed-out context leaves a mutex permanently locked. Recommendations include refactoring the lock to use channel-based semaphores, utilizing exported constants for annotation keys, and ensuring the node patch operation respects the defined context timeout.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 60.97561% with 16 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/scheduler/scheduler.go 38.46% 11 Missing and 5 partials ⚠️
Flag Coverage Δ
unittests 53.54% <60.97%> (+1.45%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/scheduler/event.go 100.00% <100.00%> (ø)
pkg/util/nodelock/nodelock.go 73.61% <100.00%> (+10.73%) ⬆️
pkg/scheduler/scheduler.go 47.30% <38.46%> (+4.81%) ⬆️

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added the kind/bug Something isn't working label Apr 1, 2026
Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
return res, nil
}

patchData := fmt.Appendf(nil, `{"metadata":{"annotations":{"%s":"%s"}}}`, nodelockutil.NodeLockKey, string(current.UID))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch makes the e2e test failed.
It seems the patchData covers the nodelock annotation.

As the nodelock annotation is released in device plugin and the in-memory lock is released after bind, the inconsistency may cause the problem.

Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
Signed-off-by: maishivamhoo123 <maishivamhoo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants