Skip to content

Comments

support enable share gpu for reclaimed#12

Open
luomingmeng wants to merge 52 commits intoJustinChengLZ:dev/support-gpu-pluginsfrom
luomingmeng:dev/support-enable-share-gpu-for-reclaimed
Open

support enable share gpu for reclaimed#12
luomingmeng wants to merge 52 commits intoJustinChengLZ:dev/support-gpu-pluginsfrom
luomingmeng:dev/support-enable-share-gpu-for-reclaimed

Conversation

@luomingmeng
Copy link

What type of PR is this?

Features

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

@luomingmeng luomingmeng force-pushed the dev/support-enable-share-gpu-for-reclaimed branch from fafce64 to 198035f Compare November 24, 2025 06:31
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch 9 times, most recently from 9c9800a to c84f8f7 Compare December 1, 2025 08:56
@luomingmeng luomingmeng force-pushed the dev/support-enable-share-gpu-for-reclaimed branch 2 times, most recently from 5046660 to 6d4961a Compare December 9, 2025 21:53
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch 2 times, most recently from 4dfcbf5 to 8176c42 Compare December 17, 2025 02:12
@luomingmeng luomingmeng force-pushed the dev/support-enable-share-gpu-for-reclaimed branch from 541bb02 to d5d22b1 Compare December 25, 2025 06:34
luomingmeng and others added 15 commits December 25, 2025 15:01
This commit introduces a new static policy implementation for GPU resource management, including:
- GPU topology provider and state management
- Static policy implementation with allocation and deallocation logic
- Integration with existing QRM framework
- Metrics and health checks for GPU resource management
- Update GPU memory type from uint64 to float64 for precise allocation
- Implement NUMA-aware GPU topology management and allocation
- Add support for associated device allocation and topology hints
- Introduce new GPU topology provider with NUMA node tracking
- Extend GPU state management with NUMA node information
- Add utility functions for GPU memory hint generation and NUMA calculations
The preferredHintIndexes variable was declared but never used in the code. Removing it improves code clarity and maintainability.
Add new functionality to handle associated device allocation requests in the resource plugin stub. This includes:
- Adding new stub function type and default implementation
- Extending the Stub struct with new field
- Adding new methods for associated device operations
…icy structs

Add pluginapi.UnimplementedResourcePluginServer to all policy structs to ensure forward compatibility with gRPC interface changes
Implement GetAssociatedDeviceTopologyHints method in StaticPolicy and update stub to handle topology hints requests. Also update kubelet dependency version and rename mustIncludeDevices to reusableDevices for clarity.
… allocated memory

Add tracking of allocated GPU memory per NUMA node and modify hint calculation to prefer nodes with most allocated memory. This helps balance GPU memory usage across NUMA nodes.
chore: add unit tests

chore: add unit tests

chore: add unit tests

chore: add unit tests
…lugins

feat: introduce rdma state and allow states to share within gpu sub-plugins

feat: introduce rdma state and allow states to share within gpu sub-plugins
…ompany resource allocation

feat: implement rdma custom device plugin and implement logic for accompany resource allocation
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch 6 times, most recently from e2365c2 to 21b097a Compare January 2, 2026 06:33
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch 9 times, most recently from eb87c70 to 33b37df Compare January 7, 2026 07:10
Add CheckReclaimed condition to skip reclaimed containers when evaluating device share status
Add test case to verify reclaimed containers are ignored
Move aggregation of allocatable and capacity quantities after health check to ensure accurate totals for unhealthy or non-shared devices
Add constant thresholdMetToleranceDurationForGPU to set a fixed 15-second tolerance duration for GPU resource eviction, replacing the dynamic configuration value.
@luomingmeng luomingmeng force-pushed the dev/support-enable-share-gpu-for-reclaimed branch from 83e65e3 to b13e4d2 Compare January 14, 2026 12:08
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch from b3d9e26 to b1649fb Compare January 20, 2026 07:58
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch 8 times, most recently from 438b7c0 to 26a3d4b Compare February 16, 2026 06:31
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch from 26a3d4b to b704e12 Compare February 24, 2026 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants