-
Notifications
You must be signed in to change notification settings - Fork 50
Open
Description
How is this issue impacting you?
Application crash
NVSHMEM Version
devel @ 9cc869b
Your platform details
p5en.48xlarge (H200)
ubuntu24
driver: 575.57.08
cuda: 12.8.1
Application: Any collective perftest
Error Message & Behavior
Nvshmem's error handling logic is broken, and when there is an NVLink state issue on the instance, Nvshmem attempts to allocate the default amount of memory for the symetric heap 256 GB.
Turning off NVLS NVSHMEM_DISABLE_NVLS=1 makes the issue go away.
NVSHMEM Error Logs:
/home/szegel/Nvshmem/src/host/team/team_internal_nvls.cpp:254: non-zero status: 401 cuMemMap failed to map 274877906944 bytes handle at address: 0x1320000000 offset 0 on device 7
/home/szegel/Nvshmem/src/host/mem/mem_heap.cpp:1573: non-zero status: 7 Mapping mem size 274877906944 to MC group 101602722380032 failed
/home/szegel/Nvshmem/src/host/team/team_internal.cpp:786: non-zero status: 7 Mapping multicast groups for UC heap failed for pe 7 team ID 1
dmesg on broken instance:
[1891044.581319] NVRM: nvCheckOkFailedNoLog: Check failed: NVLink fabric state cached by the driver is out of sync [NV_ERR_FABRIC_STATE_OUT_OF_SYNC] (0x00000087) returned from status @ mem_multicast_fabric.c:2917`
I debugged the state issue to upgrading libc6 from libc6=2.39-0ubuntu8.5 to libc6=2.39-0ubuntu8.6, and have opened a bug with Nvidia. This bug report is to fix the error handling logic in NVSHMEM. NVSHMEM should fail with an explicit error pointing to NVLINK issues instead of trying to allocate more memory than the GPU has.
Metadata
Metadata
Assignees
Labels
No labels