fix(mem): correct GPU memory accounting (host vs container) and memory limits accordingly by loiht2 · Pull Request #153 · Project-HAMi/HAMi-core

loiht2 · 2026-01-23T12:21:59Z

Fixes incorrect GPU memory reporting inside the container vs. on the host (as shown by nvidia-smi).
Enforces GPU memory limits using the corrected container-visible memory, preventing incorrect OOM enforcement.

GPU Memory Usage

In container

Command: nvidia-smi

Command: nvidia-smi -a

    FB Memory Usage
        Total                             : 3072 MiB
        Reserved                          : 274 MiB
        Used                              : 2584 MiB
        Free                              : 488 MiB

On host

Command: nvidia-smi

Command: nvidia-smi -a

    FB Memory Usage
        Total                             : 32768 MiB
        Reserved                          : 274 MiB
        Used                              : 2588 MiB
        Free                              : 29907 MiB

GPU Memory Limit Enforcement (OOM scenario)

I run the same pod, which requires ~2588 MiB GPU memory. However, in this test, the ResourceClaim requests only 2GiB (2048 MiB) GPU memory, so the pod hits GPU OOM.

Pod log showing the OOM:

…its accordingly Signed-off-by: Hoang Thanh Loi <loi.hoangthanh.24@gmail.com>

…its accordingly (updated) Signed-off-by: Hoang Thanh Loi <loi.hoangthanh.24@gmail.com>

hami-robot · 2026-01-23T12:22:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: loiht2
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…memory tracking Add a dedicated memory_monitor_watcher thread that always runs (regardless of SM limit configuration) to populate monitorused[] with per-process NVML memory. Previously monitorused[] was only updated inside utilization_watcher() which only starts when SM limits are configured (0 < sm_limit < 100). The new thread queries nvmlDeviceGetComputeRunningProcesses every 1 second and writes usedGpuMemory into procs[].monitorused[dev]. This allows the external monitor to read real GPU memory (nvidia-smi equivalent) directly from shared memory without needing nvidia-smi or NVML bindings.

hami-robot · 2026-02-22T12:07:57Z

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

747653b feat(core): add memory_monitor_watcher thread for unconditional NVML memory tracking

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Hoang Thanh Loi added 2 commits January 23, 2026 11:09

fix(mem): correct container vs host gpu memory and enforce memory lim…

1cdc65f

…its accordingly Signed-off-by: Hoang Thanh Loi <loi.hoangthanh.24@gmail.com>

fix(mem): correct container vs host gpu memory and enforce memory lim…

896e22d

…its accordingly (updated) Signed-off-by: Hoang Thanh Loi <loi.hoangthanh.24@gmail.com>

hami-robot bot requested a review from archlitchi January 23, 2026 12:22

hami-robot bot added the dco-signoff: yes label Jan 23, 2026

hami-robot bot requested a review from chaunceyjiang January 23, 2026 12:22

hami-robot bot added the size/M label Jan 23, 2026

hami-robot bot added dco-signoff: no and removed dco-signoff: yes labels Feb 22, 2026

hami-robot bot added size/L and removed size/M labels Feb 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix(mem): correct GPU memory accounting (host vs container) and memory limits accordingly#153

fix(mem): correct GPU memory accounting (host vs container) and memory limits accordingly#153
loiht2 wants to merge 3 commits intoProject-HAMi:mainfrom
loiht2:fix/container-memory

loiht2 commented Jan 23, 2026

Uh oh!

hami-robot bot commented Jan 23, 2026

Uh oh!

hami-robot bot commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

loiht2 commented Jan 23, 2026

GPU Memory Usage

In container

On host

GPU Memory Limit Enforcement (OOM scenario)

Uh oh!

hami-robot bot commented Jan 23, 2026

Uh oh!

hami-robot bot commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant