Skip to content

feat: Support K8s DRA Resources V1 APIs#596

Open
adityasingh0510 wants to merge 1 commit intoNVIDIA:mainfrom
adityasingh0510:feature/k8s-v1-resource-api-support
Open

feat: Support K8s DRA Resources V1 APIs#596
adityasingh0510 wants to merge 1 commit intoNVIDIA:mainfrom
adityasingh0510:feature/k8s-v1-resource-api-support

Conversation

@adityasingh0510
Copy link

@adityasingh0510 adityasingh0510 commented Dec 8, 2025

This PR updates dcgm-exporter to support both the stable resource.k8s.io/v1 API and the v1beta1 API for Dynamic Resource Allocation (DRA) support. This ensures compatibility with both Kubernetes 1.34+ clusters (using v1) and older clusters (using v1beta1), with automatic detection and graceful fallback.

Problem

When enabling DRA labels in dcgm-exporter on Kubernetes 1.34+ clusters, the following error occurs:

failed to list v1beta1.ResourceSlice as we have v1.ResourceSlice

This happens because:

  • Kubernetes 1.34+ promotes the ResourceSlice API from v1beta1 to stable v1
  • Clusters may only expose the v1 API, breaking code that only uses v1beta1
  • Older clusters (1.27-1.33) still use v1beta1, so we need to support both

Changes

Files Modified

  • internal/pkg/transformation/dra.go:

    • Register both v1 and v1beta1 ResourceSlice informers
    • Implement separate event handlers for each API version:
      • onAddOrUpdateV1() / onAddOrUpdateV1beta1()
      • onDeleteV1() / onDeleteV1beta1()
    • Add cache checking in delete handlers to prevent premature device removal
    • Handle API structure differences:
      • v1beta1: dev.Basic.Attributes
      • v1: dev.Attributes (direct access, no Basic wrapper)
  • internal/pkg/transformation/types.go:

    • Add v1Informer and v1beta1Informer fields to DRAResourceSliceManager struct
  • go.mod / go.sum:

    • Upgrade k8s.io/api: v0.33.3 → v0.34.0 (adds support for resource/v1)
    • Upgrade k8s.io/client-go: v0.33.3 → v0.34.0 (ensures compatibility)
    • Upgrade k8s.io/apimachinery: v0.33.3 → v0.34.0

API Structure Changes

The v1 API has a different structure than v1beta1:

API Version Device Attribute Access
v1beta1 dev.Basic.Attributes
v1 dev.Attributes (direct)

The implementation handles both structures correctly.

Behavior

Automatic API Detection

The code registers both informers and uses whichever is available:

// Both informers are registered
v1Informer := factory.Resource().V1().ResourceSlices().Informer()
v1beta1Informer := factory.Resource().V1beta1().ResourceSlices().Informer()

// At least one must sync successfully
v1Synced := cache.WaitForCacheSync(ctx.Done(), v1Informer.HasSynced)
v1beta1Synced := cache.WaitForCacheSync(ctx.Done(), v1beta1Informer.HasSynced)

Precedence Logic

When both APIs are available:

  • v1 takes precedence: v1beta1 only adds devices if v1 doesn't already have them
  • Delete protection: Before deleting, handlers check if the device exists in the other API's cache
  • No duplicate entries: Precedence logic ensures each device is only tracked once

Testing

Verification

Code compiles successfully with both API versions
All tests pass - existing unit tests continue to work
No linter errors
v1 API support - verified with Kubernetes 1.34+ API structure
v1beta1 API support - verified with Kubernetes 1.27-1.33 API structure
Dual API handling - both informers work correctly when both are available
Precedence logic - v1 correctly takes precedence over v1beta1
Delete handling - race conditions prevented with cache checking

Test Scenarios

  • Kubernetes 1.34+ clusters (v1 API only)
  • Kubernetes 1.27-1.33 clusters (v1beta1 API only)
  • Clusters with both APIs available (migration periods)
  • MIG devices work with both API versions

Backward Compatibility

Fully backward compatible:

  • Existing deployments on Kubernetes 1.27-1.33 continue to work unchanged
  • No breaking changes for any supported Kubernetes version
  • No configuration changes required

Forward compatible:

  • Ready for Kubernetes 1.34+ clusters
  • Automatically uses the best available API version

Breaking Changes

None - This is a backward and forward compatibility enhancement. The change:

  • Works on older clusters (1.27-1.33) using v1beta1
  • Works on newer clusters (1.34+) using v1
  • Works during migration periods when both are available
  • Requires no configuration changes

Related Issues

@adityasingh0510 adityasingh0510 force-pushed the feature/k8s-v1-resource-api-support branch from c179153 to 2d09218 Compare December 8, 2025 09:40
@adityasingh0510 adityasingh0510 changed the title feat: Add support for Kubernetes v1.34 resource.k8s.io/v1 APIs feat: Add dual API support for ResourceSlice (v1 and v1beta1) Dec 8, 2025
@adityasingh0510 adityasingh0510 changed the title feat: Add dual API support for ResourceSlice (v1 and v1beta1) feat: Support K8s DRA Resources V1 APIs Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for K8s v1.34 resource.k8s.io/v1 DRA APIs

1 participant