Skip to content

feat: add TTL and activeDeadlineSeconds#3258

Open
XploY04 wants to merge 12 commits intokubeflow:masterfrom
XploY04:ttl
Open

feat: add TTL and activeDeadlineSeconds#3258
XploY04 wants to merge 12 commits intokubeflow:masterfrom
XploY04:ttl

Conversation

@XploY04
Copy link

@XploY04 XploY04 commented Feb 25, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2899
KEP #3068

Checklist:

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings February 25, 2026 18:11
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces lifecycle management knobs to Kubeflow Trainer by adding activeDeadlineSeconds (per-TrainJob runtime limit) and ttlSecondsAfterFinished (runtime-default cleanup policy) across the API, controller logic, admission validation, and integration coverage.

Changes:

  • Add ActiveDeadlineSeconds to TrainJobSpec and TTLSecondsAfterFinished to TrainingRuntimeSpec (and ClusterTrainingRuntime), including CRDs/OpenAPI and apply-config types.
  • Implement TrainJob controller reconciliation for TTL-based deletion and deadline-based failure signaling.
  • Add webhook warnings for very short TTL values and integration tests/util wrappers to exercise the new behavior.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/apis/trainer/v1alpha1/trainjob_types.go Adds ActiveDeadlineSeconds field and deadline-exceeded reason constant.
pkg/apis/trainer/v1alpha1/trainingruntime_types.go Adds TTLSecondsAfterFinished plus CEL validations preventing conflicting template TTL/deadline fields.
pkg/controller/trainjob_controller.go Adds TTL resolution + TTL reconciliation and deadline reconciliation logic.
pkg/constants/constants.go Adds a standardized message for deadline-exceeded failures.
pkg/webhooks/trainingruntime_webhook.go Adds admission warnings for very short TTL and validates replicated jobs on update.
pkg/webhooks/clustertrainingruntime_webhook.go Adds admission warnings for very short TTL on create/update.
pkg/util/testing/wrapper.go Extends test wrappers to set ActiveDeadlineSeconds / TTLSecondsAfterFinished.
test/integration/controller/trainjob_controller_test.go Adds integration coverage for TTL deletion and deadline failure.
test/integration/webhooks/trainingruntime_webhook_test.go Adds integration cases intended to cover TTL/deadline webhook validation paths.
test/integration/webhooks/clustertrainingruntime_webhook_test.go Adds integration cases intended to cover TTL/deadline webhook validation paths.
pkg/client/applyconfiguration/trainer/v1alpha1/trainjobspec.go Adds apply-config builder support for activeDeadlineSeconds.
pkg/client/applyconfiguration/trainer/v1alpha1/trainingruntimespec.go Adds apply-config builder support for ttlSecondsAfterFinished.
manifests/base/crds/trainer.kubeflow.org_trainjobs.yaml Publishes activeDeadlineSeconds in CRD schema.
manifests/base/crds/trainer.kubeflow.org_trainingruntimes.yaml Publishes ttlSecondsAfterFinished + CEL validations in CRD schema.
manifests/base/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml Publishes ttlSecondsAfterFinished + CEL validations in CRD schema.
charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainjobs.yaml Mirrors TrainJob CRD changes for Helm.
charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainingruntimes.yaml Mirrors TrainingRuntime CRD changes for Helm.
charts/kubeflow-trainer/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml Mirrors ClusterTrainingRuntime CRD changes for Helm.
pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go Regenerates deepcopy for new pointer fields.
pkg/apis/trainer/v1alpha1/zz_generated.openapi.go Regenerates OpenAPI for new fields.
api/openapi-spec/swagger.json Updates published swagger with new fields.
proposal.md Adds a KEP-style proposal documenting API design and behavior.

…rainer APIs

- Add ActiveDeadlineSeconds *int64 to TrainJobSpec (immutable, min=1)
  with reason constant TrainJobDeadlineExceededReason="DeadlineExceeded"
- Add TTLSecondsAfterFinished *int32 to TrainingRuntimeSpec (min=0)
  with cross-field CEL rules blocking conflicting lifecycle fields
  in the JobSet/Job template
- Add TrainJobDeadlineExceededMessage to constants
- Regenerate zz_generated.deepcopy.go

Part of KEP-2899

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
- Add early-exit guard in Reconcile before prevTrainJob DeepCopy and
  removeFailedCondition, mirroring the Kubernetes Job controller pattern.
  Once a TrainJob is terminal (Complete or Failed), only TTL cleanup runs.
- Add fetchTTLFromRuntime helper to retrieve TTLSecondsAfterFinished from
  the referenced TrainingRuntime or ClusterTrainingRuntime.
- Add reconcileTTL helper to delete finished TrainJobs after their TTL
  expires, or requeue at the exact expiry time.

Part of KEP-2899

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
- Add reconcileDeadline helper to terminate running TrainJobs that
  exceed their ActiveDeadlineSeconds.
- Handle job suspension: StartTime is determined by the LastTransitionTime
  of the Suspended=False condition (if it exists) to ensure the deadline
  timer resets correctly upon resume.
- Jobs exceeding the deadline are moved to a Failed state with
  reason DeadlineExceeded.

Part of KEP-2899

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
- Add an admission warning to both TrainingRuntime and
  ClusterTrainingRuntime webhooks if the TTLSecondsAfterFinished
  is configured to less than 60 seconds, as this may cause
  unexpected data/log loss for data scientists due to overly
  aggressive cleanup.

Part of KEP-2899

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
- Add integration tests for DeadlineExceeded behavior in TrainJob controller
- Add integration tests for TTL validation and warnings in webhooks

Part of KEP-2899

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…gress

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…queue

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…ations and associated validation.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
@google-oss-prow google-oss-prow bot added size/L and removed size/XL labels Feb 26, 2026
…Runtime

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TTL for TrainJobs

2 participants