feat: add TTL and activeDeadlineSeconds#3258
feat: add TTL and activeDeadlineSeconds#3258XploY04 wants to merge 12 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR introduces lifecycle management knobs to Kubeflow Trainer by adding activeDeadlineSeconds (per-TrainJob runtime limit) and ttlSecondsAfterFinished (runtime-default cleanup policy) across the API, controller logic, admission validation, and integration coverage.
Changes:
- Add
ActiveDeadlineSecondstoTrainJobSpecandTTLSecondsAfterFinishedtoTrainingRuntimeSpec(and ClusterTrainingRuntime), including CRDs/OpenAPI and apply-config types. - Implement TrainJob controller reconciliation for TTL-based deletion and deadline-based failure signaling.
- Add webhook warnings for very short TTL values and integration tests/util wrappers to exercise the new behavior.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/apis/trainer/v1alpha1/trainjob_types.go |
Adds ActiveDeadlineSeconds field and deadline-exceeded reason constant. |
pkg/apis/trainer/v1alpha1/trainingruntime_types.go |
Adds TTLSecondsAfterFinished plus CEL validations preventing conflicting template TTL/deadline fields. |
pkg/controller/trainjob_controller.go |
Adds TTL resolution + TTL reconciliation and deadline reconciliation logic. |
pkg/constants/constants.go |
Adds a standardized message for deadline-exceeded failures. |
pkg/webhooks/trainingruntime_webhook.go |
Adds admission warnings for very short TTL and validates replicated jobs on update. |
pkg/webhooks/clustertrainingruntime_webhook.go |
Adds admission warnings for very short TTL on create/update. |
pkg/util/testing/wrapper.go |
Extends test wrappers to set ActiveDeadlineSeconds / TTLSecondsAfterFinished. |
test/integration/controller/trainjob_controller_test.go |
Adds integration coverage for TTL deletion and deadline failure. |
test/integration/webhooks/trainingruntime_webhook_test.go |
Adds integration cases intended to cover TTL/deadline webhook validation paths. |
test/integration/webhooks/clustertrainingruntime_webhook_test.go |
Adds integration cases intended to cover TTL/deadline webhook validation paths. |
pkg/client/applyconfiguration/trainer/v1alpha1/trainjobspec.go |
Adds apply-config builder support for activeDeadlineSeconds. |
pkg/client/applyconfiguration/trainer/v1alpha1/trainingruntimespec.go |
Adds apply-config builder support for ttlSecondsAfterFinished. |
manifests/base/crds/trainer.kubeflow.org_trainjobs.yaml |
Publishes activeDeadlineSeconds in CRD schema. |
manifests/base/crds/trainer.kubeflow.org_trainingruntimes.yaml |
Publishes ttlSecondsAfterFinished + CEL validations in CRD schema. |
manifests/base/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml |
Publishes ttlSecondsAfterFinished + CEL validations in CRD schema. |
charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainjobs.yaml |
Mirrors TrainJob CRD changes for Helm. |
charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainingruntimes.yaml |
Mirrors TrainingRuntime CRD changes for Helm. |
charts/kubeflow-trainer/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml |
Mirrors ClusterTrainingRuntime CRD changes for Helm. |
pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go |
Regenerates deepcopy for new pointer fields. |
pkg/apis/trainer/v1alpha1/zz_generated.openapi.go |
Regenerates OpenAPI for new fields. |
api/openapi-spec/swagger.json |
Updates published swagger with new fields. |
proposal.md |
Adds a KEP-style proposal documenting API design and behavior. |
test/integration/webhooks/clustertrainingruntime_webhook_test.go
Outdated
Show resolved
Hide resolved
…rainer APIs - Add ActiveDeadlineSeconds *int64 to TrainJobSpec (immutable, min=1) with reason constant TrainJobDeadlineExceededReason="DeadlineExceeded" - Add TTLSecondsAfterFinished *int32 to TrainingRuntimeSpec (min=0) with cross-field CEL rules blocking conflicting lifecycle fields in the JobSet/Job template - Add TrainJobDeadlineExceededMessage to constants - Regenerate zz_generated.deepcopy.go Part of KEP-2899 Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
- Add early-exit guard in Reconcile before prevTrainJob DeepCopy and removeFailedCondition, mirroring the Kubernetes Job controller pattern. Once a TrainJob is terminal (Complete or Failed), only TTL cleanup runs. - Add fetchTTLFromRuntime helper to retrieve TTLSecondsAfterFinished from the referenced TrainingRuntime or ClusterTrainingRuntime. - Add reconcileTTL helper to delete finished TrainJobs after their TTL expires, or requeue at the exact expiry time. Part of KEP-2899 Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
- Add reconcileDeadline helper to terminate running TrainJobs that exceed their ActiveDeadlineSeconds. - Handle job suspension: StartTime is determined by the LastTransitionTime of the Suspended=False condition (if it exists) to ensure the deadline timer resets correctly upon resume. - Jobs exceeding the deadline are moved to a Failed state with reason DeadlineExceeded. Part of KEP-2899 Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
- Add an admission warning to both TrainingRuntime and ClusterTrainingRuntime webhooks if the TTLSecondsAfterFinished is configured to less than 60 seconds, as this may cause unexpected data/log loss for data scientists due to overly aggressive cleanup. Part of KEP-2899 Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
- Add integration tests for DeadlineExceeded behavior in TrainJob controller - Add integration tests for TTL validation and warnings in webhooks Part of KEP-2899 Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…gress Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…queue Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…ations and associated validation. Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…Runtime Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #2899
KEP #3068
Checklist: