Skip to content

fix(scheduler): schedule tasks on new pods when BatchSandbox scales out#399

Open
zerone0x wants to merge 1 commit intoalibaba:mainfrom
zerone0x:fix/batchsandbox-scale-out-tasks
Open

fix(scheduler): schedule tasks on new pods when BatchSandbox scales out#399
zerone0x wants to merge 1 commit intoalibaba:mainfrom
zerone0x:fix/batchsandbox-scale-out-tasks

Conversation

@zerone0x
Copy link
Contributor

Summary

Closes #102

When a BatchSandbox is scaled out (replicas increased), the existing in-memory TaskScheduler was not informed of the new task specs for the additional replicas. As a result, newly created pods received no task assignment, leaving them idle indefinitely.

Root Causes & Fixes

1. TaskScheduler.AddTasks() — new interface method

Appends task nodes for specs not yet tracked by the scheduler. Already-tracked task names are silently skipped, making it safe to call with the full task list on every reconciliation cycle (no duplicates).

2. Call AddTasks() in getTaskScheduler() after UpdatePods()

The existing code only called UpdatePods(pods) when a scheduler was retrieved from the cache. Now it also calls AddTasks(taskSpecs) with the current replica set so that any replicas added since the scheduler was first created get their task nodes registered immediately and can receive pod assignments on the next Schedule() call.

3. Remove spurious pods = append(pods, pod) in scaleBatchSandbox()

This line doubled every entry in the local pods slice on each iteration without any practical effect (Go's range evaluates the initial slice header once), making the code harder to reason about. Removing it has no behaviour change but improves clarity.

Changes

File Change
kubernetes/internal/scheduler/interface.go Add AddTasks([]*api.Task) error to TaskScheduler interface
kubernetes/internal/scheduler/default_scheduler.go Implement AddTasks on defaultTaskScheduler
kubernetes/internal/scheduler/mock/interface.go Update generated mock to include AddTasks
kubernetes/internal/controller/batchsandbox_controller.go Call AddTasks on scale-out; remove spurious append
kubernetes/internal/scheduler/default_scheduler_test.go Add Test_addTasks covering scale-out, no-op, and empty-scheduler cases

Testing

  • Existing unit tests pass (TestBatchSandboxReconciler_scheduleTasks, Test_parseIndex, Test_assignTaskNodes, Test_scheduleTaskNodes, Test_initTaskNodes)
  • New Test_addTasks covering three scenarios: scale-out (new tasks appended, existing skipped), no-op (same tasks, no duplicates), empty scheduler (initial population)
  • Integration test (requires a running Kubernetes cluster)

Breaking Changes

  • None. AddTasks is a new method added to the TaskScheduler interface. The mock has been updated accordingly; any other custom implementations of this internal interface will need to add the method.

When a BatchSandbox is scaled out (replicas increased), the existing
in-memory TaskScheduler was not informed of the new task specs for the
additional replicas. As a result, newly created pods received no task
assignment, leaving them idle.

Root causes fixed:
1. Add TaskScheduler.AddTasks() – a new interface method that appends
   task nodes for specs not yet tracked by the scheduler. Already-
   tracked task names are silently skipped, making it safe to call with
   the full task list on every reconciliation cycle.
2. Call AddTasks() in getTaskScheduler() after UpdatePods() so that any
   replicas added since the scheduler was first created get their task
   nodes registered immediately.
3. Remove the spurious  in scaleBatchSandbox()
   that doubled every entry in the local pods slice on each iteration
   without any effect (range evaluated the initial slice length), making
   the code easier to read and reason about.

Fixes alibaba#102

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 10, 2026 02:52
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes the scheduler cache behavior so newly created pods receive task assignments after a BatchSandbox scales out (replicas increased).

Changes:

  • Added AddTasks() to the TaskScheduler interface and implemented it in the default scheduler.
  • Updated controller reconciliation to call AddTasks() after UpdatePods() when reusing a cached scheduler.
  • Updated mocks and added unit tests for AddTasks; removed a confusing no-op append in scale code.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
kubernetes/internal/scheduler/interface.go Extends scheduler interface with AddTasks() to support scale-out.
kubernetes/internal/scheduler/default_scheduler.go Implements AddTasks() to append previously-untracked task nodes.
kubernetes/internal/scheduler/mock/interface.go Updates generated gomock to include AddTasks().
kubernetes/internal/controller/batchsandbox_controller.go Calls AddTasks() on cached scheduler and removes spurious append.
kubernetes/internal/scheduler/default_scheduler_test.go Adds unit tests validating AddTasks() behavior across scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 27 to +33
UpdatePods(pod []*corev1.Pod)
ListTask() []Task
StopTask() []Task
// AddTasks registers task specs that are not yet tracked by the scheduler.
// Tasks whose names are already tracked are silently skipped, making this
// safe to call with the full task list during a scale-out reconciliation.
AddTasks(tasks []*apis.Task) error
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new AddTasks method uses *apis.Task here, but the implementation and mock in this PR use *api.Task. If these are different packages/types, defaultTaskScheduler and MockTaskScheduler will not satisfy TaskScheduler and the build will fail. Unify the task type across the interface, default implementation, and mock (same import path and identifier), and regenerate mocks if needed.

Copilot uses AI. Check for mistakes.
Comment on lines +221 to +233
func (sch *defaultTaskScheduler) AddTasks(tasks []*api.Task) error {
newNodes, err := initTaskNodes(tasks)
if err != nil {
return err
}
for _, node := range newNodes {
if _, exists := sch.taskNodeByNameIndex[node.Name]; !exists {
sch.taskNodes = append(sch.taskNodes, node)
sch.taskNodeByNameIndex[node.Name] = node
}
}
return nil
}
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddTasks currently calls initTaskNodes(tasks) for the full task list every time, even though most tasks may already be tracked. In a reconciliation loop, this can become unnecessarily expensive as replica counts grow (allocating []*taskNode proportional to total replicas every call). Consider filtering to only the missing task specs first (by checking taskNodeByNameIndex on the incoming tasks), then initializing/appending nodes only for that smaller subset.

Copilot uses AI. Check for mistakes.
@CLAassistant
Copy link

CLAassistant commented Mar 10, 2026

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: no task scheduled on new created Sandbox when BatchSandbox scale out

3 participants