feat(runtimes): Add XGBoost runtime(KEP-2598) by Krishna-kg732 · Pull Request #3200 · kubeflow/trainer

Krishna-kg732 · 2026-02-12T04:24:45Z

What this PR does

Implements the XGBoost runtime plugin for Kubeflow Trainer V2, as proposed in KEP-2598. This plugin enables distributed XGBoost training using Rabit/Collective coordination by automatically injecting DMLC environment variables into trainer containers.

Changes

New Files

pkg/runtime/framework/plugins/xgboost/xgboost.go — Plugin implementing EnforceMLPolicyPlugin and CustomValidationPlugin. Injects DMLC_TRACKER_URI, DMLC_TRACKER_PORT, DMLC_TASK_ID, DMLC_NUM_WORKER env vars and auto-derives numWorkersPerNode from GPU resources (1 worker per GPU, or 1 per node for CPU).
pkg/runtime/framework/plugins/xgboost/xgboost_test.go — Unit tests covering EnforceMLPolicy (nil guards, single/multi-node CPU, GPU resources, numNodes override) and Validate (reserved DMLC_* env name rejection).

Modified Files

pkg/apis/trainer/v1alpha1/trainingruntime_types.go — Added XGBoostMLPolicySource struct, XGBoost field to MLPolicySource, and updated CEL mutual exclusion validation rule.
pkg/constants/constants.go — Added XGBoost/Rabit constants and XGBoostReservedEnvNames set.
pkg/runtime/framework/plugins/registry.go — Registered the XGBoost plugin.
pkg/runtime/framework/plugins/plainml/plainml.go — Added XGBoost to the PlainML fallback guard.
pkg/runtime/framework/core/framework_test.go — Updated TestNew to include XGBoost in expected plugin lists.
pkg/util/testing/wrapper.go — Added XGBoostPolicy() test helper.

How was this tested?

go test ./pkg/runtime/framework/plugins/xgboost/... ✅ (9 test cases)
go test ./pkg/runtime/framework/core/ -run TestNew ✅
go test ./pkg/runtime/framework/plugins/... ✅ (all plugins pass)

##Closes
#2598

TODO (follow-up PRs)

Add E2E tests
Add ClusterTrainingRuntime YAML manifests
Add example notebook

/kind feature
/area runtime

google-oss-prow · 2026-02-12T04:24:50Z

@Krishna-kg732: The label(s) area/runtime cannot be applied, because the repository doesn't have them.

Details

In response to this:

What this PR does

Adds the XGBoost runtime plugin scaffold to the Trainer V2 framework. This is the foundational PR for KEP-2598: XGBoost Runtime — it introduces the plugin structure and API types without the full implementation, which will follow in a subsequent PR.

Changes

New Files

pkg/runtime/framework/plugins/xgboost/xgboost.go — Plugin scaffold implementing EnforceMLPolicyPlugin with a stub EnforceMLPolicy (Rabit env injection TODO)

Modified Files

pkg/apis/trainer/v1alpha1/trainingruntime_types.go — Added XGBoostMLPolicySource struct and XGBoost field to MLPolicySource

pkg/constants/constants.go — Added XGBoost/Rabit constants (DMLC_TRACKER_URI, DMLC_TRACKER_PORT, DMLC_TASK_ID, DMLC_NUM_WORKER) and reserved env set

pkg/runtime/framework/plugins/registry.go — Registered the XGBoost plugin

pkg/runtime/framework/plugins/plainml/plainml.go — Added XGBoost to the PlainML fallback guard

What's NOT in this PR (intentionally)

EnforceMLPolicy implementation (Rabit env var injection) — will be in a follow-up PR

Unit tests and E2E tests — will accompany the implementation PR

ClusterTrainingRuntime YAML manifests

How was this tested?

go build ./pkg/runtime/framework/plugins/... ✅

go vet ./pkg/runtime/framework/plugins/xgboost/... ✅

/kind feature
/area runtime

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow · 2026-02-12T04:24:53Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-02-12T04:24:56Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

Adds an initial XGBoost runtime plugin scaffold to the Trainer V2 runtime framework (per KEP-2598), along with the API wiring and constants needed to support a future Rabit env var injection implementation.

Changes:

Introduces an xgboost runtime plugin scaffold implementing EnforceMLPolicyPlugin (stubbed behavior for now).
Extends the TrainingRuntime API (MLPolicySource) with an xgboost policy source and updates the “only one policy” validation rule.
Adds XGBoost/Rabit-related env var constants and registers the plugin in the runtime plugin registry (and updates PlainML fallback guard).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pkg/runtime/framework/plugins/xgboost/xgboost.go	New XGBoost plugin scaffold (EnforceMLPolicy stub + plugin name/factory).
pkg/runtime/framework/plugins/registry.go	Registers the XGBoost plugin in the plugin factory registry.
pkg/runtime/framework/plugins/plainml/plainml.go	Ensures PlainML no-ops when XGBoost (and JAX) ML policy sources are configured.
pkg/constants/constants.go	Adds Rabit/XGBoost env var constants + reserved env name set.
pkg/apis/trainer/v1alpha1/trainingruntime_types.go	Adds `XGBoostMLPolicySource` + `MLPolicySource.XGBoost`, and updates ML policy exclusivity validation.

pkg/runtime/framework/plugins/xgboost/xgboost.go

pkg/apis/trainer/v1alpha1/trainingruntime_types.go

coveralls · 2026-02-16T13:12:15Z

Pull Request Test Coverage Report for Build 22090812203

Details

77 of 84 (91.67%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+1.2%) to 57.148%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/runtime/framework/plugins/registry.go	0	1	0.0%
pkg/runtime/framework/plugins/xgboost/xgboost.go	72	78	92.31%

Totals
Change from base Build 22081023611:	1.2%
Covered Lines:	1467
Relevant Lines:	2567

💛 - Coveralls

akshaychitneni · 2026-02-17T00:35:35Z

/lgtm
Thanks @Krishna-kg732

pkg/apis/trainer/v1alpha1/trainingruntime_types.go

google-oss-prow · 2026-02-17T08:15:05Z

New changes are detected. LGTM label has been removed.

andreyvelich

Thank you for this work @Krishna-kg732!
Overall looks great, I left a few comments.
cc @kubeflow/kubeflow-trainer-team

pkg/runtime/framework/plugins/xgboost/xgboost_test.go

pkg/apis/trainer/v1alpha1/trainingruntime_types.go

andreyvelich · 2026-02-18T02:06:58Z

pkg/apis/trainer/v1alpha1/trainingruntime_types.go

+// XGBoostMLPolicySource represents an XGBoost runtime configuration.
+// The number of workers per node is automatically derived from container GPU resources:
+//   - GPU training: 1 worker per GPU (from resourcesPerNode)
+//   - CPU training: 1 worker per node


Can you clarify that XGBoost still a single worker still consumes all CPU cores.
Ref: #3118 (comment)
cc @trivialfis

pkg/runtime/framework/plugins/xgboost/xgboost.go

andreyvelich · 2026-02-18T02:17:51Z

pkg/runtime/framework/plugins/xgboost/xgboost_test.go

+	}
+}
+
+func TestXGBoostValidate(t *testing.T) {


Please move this Test to the top of the file.

andreyvelich · 2026-02-18T02:21:21Z

pkg/runtime/framework/plugins/xgboost/xgboost_test.go

+	utiltesting "github.com/kubeflow/trainer/v2/pkg/util/testing"
+)
+
+func TestXGBoostEnforceMLPolicy(t *testing.T) {


@tenzen-y @astefanutti @kaisoz Shall we change JAX and Torch unit tests to similar name too?
e.g. TestJAXEnforceMLPolicyhttps://github.com/Krishna-kg732/trainer/blob/dc135be8b1428ac8145102a4a255826c9490a4e9/pkg/runtime/framework/plugins/jax/jax_test.go#L40

andreyvelich · 2026-02-18T02:23:08Z

pkg/runtime/framework/plugins/xgboost/xgboost_test.go

@@ -0,0 +1,467 @@
+/*


Please also add:

Integration test. Check: https://github.com/Krishna-kg732/trainer/blob/dc135be8b1428ac8145102a4a255826c9490a4e9/test/integration/controller/trainjob_controller_test.go#L1408

E2E tests. Check: https://github.com/Krishna-kg732/trainer/blob/dc135be8b1428ac8145102a4a255826c9490a4e9/test/e2e/e2e_test.go#L184

Example Notebook with XGBoost training.

pkg/runtime/framework/plugins/xgboost/xgboost.go

- Add test case for resources set in Runtime only - Add test case for resources set in both Runtime and TrainJob - Add xgboost_distributed.yaml ClusterTrainingRuntime manifest - Update kustomization.yaml with xgboost-runtime image Addresses feedback from PR kubeflow#3200 for complete resource resolution test coverage. Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

astefanutti · 2026-02-23T08:46:53Z

/retest

google-oss-prow · 2026-02-23T09:04:01Z

@Krishna-kg732: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

astefanutti · 2026-02-23T09:08:02Z

pkg/runtime/framework/plugins/xgboost/xgboost.go

+	var allErrs field.ErrorList
+	if newObj.Spec.Trainer != nil {
+		specPath := field.NewPath("spec", "trainer", "env")
+		for i, env := range newObj.Spec.Trainer.Env {


Would we also want to validate those reserved environment variables are not set in PodTemplateOverrides?

Hey @astefanutti
since the ContainerOverride.Env docs already restrict setting envs for the node container via PodTemplateOverrides, this seems like an improvement for all runtimes rather than XGBoost-specific. Happy to address it as a follow-up if needed

cc: @andreyvelich

Yes, user cannot set node container env via TemplateOverrides since we apply merge before we trigger plugins and construct the final JobSet: https://github.com/Krishna-kg732/krishna-kg732-trainer/blob/a01c5bfabe422d613687e859c5ca231a012a8679/pkg/runtime/core/trainingruntime.go#L142

I think that after this change: #3199, we should significantly improve the validation for the TemplateOverride API. Right now, it’s not clear to users what behavior to expect or how their overrides are being applied.
cc @kaisoz @tenzen-y

astefanutti · 2026-02-23T09:10:17Z

@Krishna-kg732 can you run go lint please?

- Add test case for resources set in Runtime only - Add test case for resources set in both Runtime and TrainJob - Add xgboost_distributed.yaml ClusterTrainingRuntime manifest - Update kustomization.yaml with xgboost-runtime image Addresses feedback from PR kubeflow#3200 for complete resource resolution test coverage. Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

andreyvelich · 2026-02-24T21:17:09Z

@Krishna-kg732 You might need to rebase this PR to fix conflicts.

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in> Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

…ces are set in TrainJob) ,using the ContainerTrainerPort instead of default port Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

- Add test case for resources set in Runtime only - Add test case for resources set in both Runtime and TrainJob - Add xgboost_distributed.yaml ClusterTrainingRuntime manifest - Update kustomization.yaml with xgboost-runtime image Addresses feedback from PR kubeflow#3200 for complete resource resolution test coverage. Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

…nil safety Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

astefanutti · 2026-02-26T11:14:51Z

/retest

Copilot AI review requested due to automatic review settings February 12, 2026 04:24

google-oss-prow bot added the kind/feature label Feb 12, 2026

google-oss-prow bot requested review from akshaychitneni and kuizhiqing February 12, 2026 04:24

google-oss-prow bot added the size/M label Feb 12, 2026

Copilot started reviewing on behalf of Krishna-kg732 February 12, 2026 04:25 View session

Krishna-kg732 changed the title ~~feat(runtime): Add XGBoost runtime plugin scaffold (KEP-2598)~~ feat(runtime): Add XGBoost runtime(KEP-2598) Feb 12, 2026

Krishna-kg732 changed the title ~~feat(runtime): Add XGBoost runtime(KEP-2598)~~ feat(runtimes): Add XGBoost runtime(KEP-2598) Feb 12, 2026

Copilot AI reviewed Feb 12, 2026

View reviewed changes

pkg/runtime/framework/plugins/xgboost/xgboost.go Outdated Show resolved Hide resolved

pkg/apis/trainer/v1alpha1/trainingruntime_types.go Show resolved Hide resolved

Krishna-kg732 force-pushed the xgboost-runtime-implementation branch from 729c8be to 49c768a Compare February 12, 2026 04:33

google-oss-prow bot added size/L and removed size/M labels Feb 14, 2026

Krishna-kg732 force-pushed the xgboost-runtime-implementation branch 2 times, most recently from 985eaf4 to e5c552e Compare February 14, 2026 05:10

google-oss-prow bot added size/XL and removed size/L labels Feb 16, 2026

Krishna-kg732 force-pushed the xgboost-runtime-implementation branch 2 times, most recently from 7ec359f to 38e1f5a Compare February 16, 2026 13:29

google-oss-prow bot assigned akshaychitneni Feb 17, 2026

google-oss-prow bot added the lgtm label Feb 17, 2026

akshaychitneni reviewed Feb 17, 2026

View reviewed changes

pkg/apis/trainer/v1alpha1/trainingruntime_types.go Outdated Show resolved Hide resolved

Krishna-kg732 force-pushed the xgboost-runtime-implementation branch from 38e1f5a to dc135be Compare February 17, 2026 08:15

google-oss-prow bot removed the lgtm label Feb 17, 2026

andreyvelich reviewed Feb 18, 2026

View reviewed changes

Krishna-kg732 force-pushed the xgboost-runtime-implementation branch from 8e0bdd8 to abc5d1c Compare February 23, 2026 06:11

astefanutti reviewed Feb 23, 2026

View reviewed changes

Krishna-kg732 force-pushed the xgboost-runtime-implementation branch 3 times, most recently from c1e2de7 to 5b24ec0 Compare February 23, 2026 09:27

Krishna-kg732 force-pushed the xgboost-runtime-implementation branch from 5b24ec0 to 169a14d Compare February 24, 2026 03:45

Krishna-kg732 force-pushed the xgboost-runtime-implementation branch 2 times, most recently from 34221a9 to 169a14d Compare February 25, 2026 05:17

Krishna-kg732 added 11 commits February 25, 2026 13:10

Add XGBoost runtime plugin scaffold and register in framework

637d177

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in> Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

regenerated assets

08c28b5

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in> Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

Add XGBoost plugin unit tests and update framework test registry

6504ac8

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in> Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

added :Runtime image , unit tests for :(Resources are not set, Resour…

0de6c22

…ces are set in TrainJob) ,using the ContainerTrainerPort instead of default port Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

minor changes

8e9e5e6

Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

added plainML fallback(JAX+XGBOOST)

4099e6b

Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

added autogenerat3ed files

7c7124f

Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

CI fixes

0ccf81d

Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

chore: fix ci lint errors

3f5b2dc

Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

Add XGBoost integration and E2E tests, fix Validate guard clause and …

a01c5bf

…nil safety Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

Krishna-kg732 force-pushed the xgboost-runtime-implementation branch 2 times, most recently from 394007f to 5824b65 Compare February 26, 2026 09:35

trigger ci

b91021a

Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>

Krishna-kg732 force-pushed the xgboost-runtime-implementation branch from 5824b65 to b91021a Compare February 26, 2026 09:43

Conversation

Krishna-kg732 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Changes

New Files

Modified Files

How was this tested?

TODO (follow-up PRs)

Uh oh!

google-oss-prow bot commented Feb 12, 2026

What this PR does

Changes

New Files

Modified Files

What's NOT in this PR (intentionally)

How was this tested?

Uh oh!

google-oss-prow bot commented Feb 12, 2026

Uh oh!

github-actions bot commented Feb 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

coveralls commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 22090812203

Details

💛 - Coveralls

Uh oh!

akshaychitneni commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

google-oss-prow bot commented Feb 17, 2026

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreyvelich Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreyvelich Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

andreyvelich Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

andreyvelich Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Krishna-kg732 Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

astefanutti commented Feb 23, 2026

Uh oh!

google-oss-prow bot commented Feb 23, 2026

Uh oh!

astefanutti Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Krishna-kg732 Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

andreyvelich Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefanutti commented Feb 23, 2026

Krishna-kg732 commented Feb 12, 2026 •

edited

Loading

coveralls commented Feb 16, 2026 •

edited

Loading

akshaychitneni commented Feb 17, 2026 •

edited

Loading

andreyvelich Feb 25, 2026 •

edited

Loading