-
Notifications
You must be signed in to change notification settings - Fork 908
Description
For a deep learning cluster, it is common case that all kinds of tasks (like TFJob, MPIJob, Deployment, Statefulset, etc.) submitted by users are waiting for resource to be allocated. Unfortunately, Pod is the minimal scheduling unit, which brings hurdle to mange tasks the way other clusters like Slurm do.
To make up such a feature missing, @denkensk and I work together with other contributors to present a new queue system for tasks on Kubernetes cluster called kube-queue. Unlike the queue in volcano, kube-queue does not hijack the creation/submission of tasks. Instead, kube-queue relies operators of each task API (like TFJob, MPIJob) to wait until a clear ready-to-go message confirmed by kube-queue and delivered to the task itself via annotation of the CR.
We'd like to integrate kube-queue with training-operator, which requires minimal changes to the Reconcile method:
import (
...
queuev1alpha1 "github.com/kube-queue/pkg/apis/scheduling/v1alpha1"
....
)
func (r *XXJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
...
if queuev1alpha1.JobSuspended(job) {
logger.Info("job suspended by kube-queue")
return ctrl.Result{RequeueAfter: 10*time.Second}, nil
}
...
}Certainly, such logic can be turn on and off via the launch argument of training-operator.
The proposal of kube-queue has been submitted to Kubernetes wg-batch, pending further discussion and the implementation is now managing thousands of tasks within Alibaba and Baidu.