Skip to content

[Issue]: NVSHMEM Attempts To Allocate 256GB of Memory When NVLINK Driver Fabric State Is Out Of Sync #40

@a-szegel

Description

@a-szegel

How is this issue impacting you?

Application crash

NVSHMEM Version

devel @ 9cc869b

Your platform details

p5en.48xlarge (H200)
ubuntu24
driver: 575.57.08
cuda: 12.8.1
Application: Any collective perftest

Error Message & Behavior

Nvshmem's error handling logic is broken, and when there is an NVLink state issue on the instance, Nvshmem attempts to allocate the default amount of memory for the symetric heap 256 GB.

Turning off NVLS NVSHMEM_DISABLE_NVLS=1 makes the issue go away.

NVSHMEM Error Logs:

/home/szegel/Nvshmem/src/host/team/team_internal_nvls.cpp:254: non-zero status: 401 cuMemMap failed to map 274877906944 bytes handle at address: 0x1320000000 offset 0 on device 7
/home/szegel/Nvshmem/src/host/mem/mem_heap.cpp:1573: non-zero status: 7 Mapping mem size 274877906944 to MC group 101602722380032 failed 
/home/szegel/Nvshmem/src/host/team/team_internal.cpp:786: non-zero status: 7 Mapping multicast groups for UC heap failed for pe 7 team ID 1

dmesg on broken instance:

[1891044.581319] NVRM: nvCheckOkFailedNoLog: Check failed: NVLink fabric state cached by the driver is out of sync [NV_ERR_FABRIC_STATE_OUT_OF_SYNC] (0x00000087) returned from status @ mem_multicast_fabric.c:2917`

I debugged the state issue to upgrading libc6 from libc6=2.39-0ubuntu8.5 to libc6=2.39-0ubuntu8.6, and have opened a bug with Nvidia. This bug report is to fix the error handling logic in NVSHMEM. NVSHMEM should fail with an explicit error pointing to NVLINK issues instead of trying to allocate more memory than the GPU has.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions