Skip to content

[Issue]: Failed to run reduction_latency across nodes using ibrc #16

@kwu130

Description

@kwu130

How is this issue impacting you?

Application hang

Share Your Debug Logs

Operation log as follows:
log.txt

Steps to Reproduce the Issue

Minimal Steps:

cd nvshmem-3.4.5-0
cmake -DCUDA_ARCHITECTURES="75" -B build -S .
cd build
make -j8
cd build-install/bin/perftest/device/coll
nvshmrun -n 4 -hosts node1,node2 ./reduction_latency -w 0 -n 1 -c 1 -t 32 -b 8 -e 16

Environment Details:

NVSHMEM_IBGDA_SUPPORT=1
NVSHMEM_DISABLE_MNNVL=true
NVSHMEM_BUILD_EXAMPLES=1
NVSHMEM_MPI_SUPPORT=0
NVSHMEM_IBGDA_NIC_HANDLER=gpu
NVSHMEM_IB_DISABLE_DMABUF=1
NVSHMEM_USE_GDRCOPY=0
NVSHMEM_NVTX=0
NVSHMEM_DISABLE_P2P=1
NVSHMEM_BUILD_TESTS=1
NVSHMEM_IB_ENABLE_IBGDA=0
NVSHMEM_DEBUG_SUBSYS=ALL
NVSHMEM_DEVICELIB_CUDA_HOME=/usr/local/cuda
NVSHMEM_PREFIX=/opt/nvshmem-3.4.5-0/build-install
NVSHMEM_IBRC_SUPPORT=1
NVSHMEM_DISABLE_CUDA_VMM=true
NVSHMEM_DEBUG=INFO

Intermittency: everytime
Previous Success: no, also failed in nvshmem-3.2.5

NVSHMEM Version

nvshmem-3.4.5-0 + cuda12.6

Your platform details

Image Image

Error Message & Behavior

[nvshmem-3.4.5-0/perftest/device/coll/reduction_latency.cu:253] cuda failed with an illegal memory access was encountered
[nvshmem-3.4.5-0/perftest/device/coll/reduction_latency.cu:253] cuda failed with an illegal memory access was encountered

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions