fix(mem): correct GPU memory accounting (host vs container) and memory limits accordingly#153
fix(mem): correct GPU memory accounting (host vs container) and memory limits accordingly#153loiht2 wants to merge 3 commits intoProject-HAMi:mainfrom
Conversation
…its accordingly Signed-off-by: Hoang Thanh Loi <loi.hoangthanh.24@gmail.com>
…its accordingly (updated) Signed-off-by: Hoang Thanh Loi <loi.hoangthanh.24@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: loiht2 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
…memory tracking Add a dedicated memory_monitor_watcher thread that always runs (regardless of SM limit configuration) to populate monitorused[] with per-process NVML memory. Previously monitorused[] was only updated inside utilization_watcher() which only starts when SM limits are configured (0 < sm_limit < 100). The new thread queries nvmlDeviceGetComputeRunningProcesses every 1 second and writes usedGpuMemory into procs[].monitorused[dev]. This allows the external monitor to read real GPU memory (nvidia-smi equivalent) directly from shared memory without needing nvidia-smi or NVML bindings.
|
Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits. 📝 Please follow instructions in the contributing guide to update your commits with the DCO Full details of the Developer Certificate of Origin can be found at developercertificate.org. The list of commits missing DCO signoff:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
nvidia-smi).GPU Memory Usage
In container
Command:

nvidia-smiCommand:
nvidia-smi -aOn host
Command:

nvidia-smiCommand:
nvidia-smi -aGPU Memory Limit Enforcement (OOM scenario)
I run the same pod, which requires ~2588 MiB GPU memory. However, in this test, the ResourceClaim requests only 2GiB (2048 MiB) GPU memory, so the pod hits GPU OOM.
Pod log showing the OOM:
