[Issue]: NVSHMEM Attempts To Allocate 256GB of Memory When NVLINK Driver Fabric State Is Out Of Sync

### How is this issue impacting you?

Application crash

### NVSHMEM Version

devel @ 9cc869b

### Your platform details

p5en.48xlarge (H200)
ubuntu24
driver: 575.57.08
cuda: 12.8.1
Application: Any collective perftest

### Error Message & Behavior

Nvshmem's error handling logic is broken, and when there is an NVLink state issue on the instance, Nvshmem attempts to allocate the default amount of memory for the symetric heap 256 GB.

Turning off NVLS `NVSHMEM_DISABLE_NVLS=1` makes the issue go away.

NVSHMEM Error Logs:
```
/home/szegel/Nvshmem/src/host/team/team_internal_nvls.cpp:254: non-zero status: 401 cuMemMap failed to map 274877906944 bytes handle at address: 0x1320000000 offset 0 on device 7
/home/szegel/Nvshmem/src/host/mem/mem_heap.cpp:1573: non-zero status: 7 Mapping mem size 274877906944 to MC group 101602722380032 failed 
/home/szegel/Nvshmem/src/host/team/team_internal.cpp:786: non-zero status: 7 Mapping multicast groups for UC heap failed for pe 7 team ID 1
```

dmesg on broken instance:
```
[1891044.581319] NVRM: nvCheckOkFailedNoLog: Check failed: NVLink fabric state cached by the driver is out of sync [NV_ERR_FABRIC_STATE_OUT_OF_SYNC] (0x00000087) returned from status @ mem_multicast_fabric.c:2917`
```

I debugged the state issue to upgrading libc6 from `libc6=2.39-0ubuntu8.5` to `libc6=2.39-0ubuntu8.6`, and have opened a bug with Nvidia. This bug report is to fix the error handling logic in NVSHMEM. NVSHMEM should fail with an explicit error pointing to NVLINK issues instead of trying to allocate more memory than the GPU has.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: NVSHMEM Attempts To Allocate 256GB of Memory When NVLINK Driver Fabric State Is Out Of Sync #40

How is this issue impacting you?

NVSHMEM Version

Your platform details

Error Message & Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: NVSHMEM Attempts To Allocate 256GB of Memory When NVLINK Driver Fabric State Is Out Of Sync #40

Description

How is this issue impacting you?

NVSHMEM Version

Your platform details

Error Message & Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions