Skip to content

[Issue]: make error with v1.1.0-2 #2

@billcsm

Description

@billcsm

Problem Description

We followed the instruction to install ANP Plugin v1.1.0-2 from GitHub. The error occurs during the "make" process, and the details are as follows:

$ make RCCL_BUILD=$RCCL_BUILD MPI_INCLUDE=$MPI_INCLUDE MPI_LIB_PATH=$MPI_LIB_PATH
/opt/rocm/bin/hipcc -fPIC -g -O2 -O3 -DNDEBUG -Werror -MMD -MP -DTARGET_PLUGIN -DNCCL_BUILD_RDMA_CORE -UANP_TELEMETRY_ENABLED -D__HIP_PLATFORM_AMD__ -Iinclude -I/opt/rocm/include -I/usr/include -I/home/ubuntu/github/rccl/build/include -I/home/ubuntu/github/rccl/build/hipify/src -I/home/ubuntu/github/rccl/build/hipify/src/include -I/opt/ompi/include -c src/net_ib.cc -o build/src/net_ib.o
src/net_ib.cc:46:9: error: 'NCCL_NET_OPTIONAL_RECV_COMPLETION' macro redefined [-Werror,-Wmacro-redefined]
   46 | #define NCCL_NET_OPTIONAL_RECV_COMPLETION    (void *)0x1
      |         ^
/home/ubuntu/github/rccl/build/hipify/src/include/nccl_net.h:18:9: note: previous definition is here
   18 | #define NCCL_NET_OPTIONAL_RECV_COMPLETION 0x1
      |         ^
src/net_ib.cc:56:9: error: 'NCCL_NET_PLUGIN_SYMBOL' macro redefined [-Werror,-Wmacro-redefined]
   56 | #define NCCL_NET_PLUGIN_SYMBOL ncclNetPlugin_v8
      |         ^
/home/ubuntu/github/rccl/build/hipify/src/include/nccl_net.h:120:9: note: previous definition is here
  120 | #define NCCL_NET_PLUGIN_SYMBOL ncclNetPlugin_v9
      |         ^
src/net_ib.cc:2732:14: error: cannot initialize a member subobject of type 'ncclResult_t (*)(void *, void *, size_t, int, void *, void **)' (aka 'ncclResult_t (*)(void *, void *, unsigned long, int, void *, void **)') with an lvalue of type 'ncclResult_t (void *, void *, int, int, void *, void **)': type mismatch at 3rd parameter ('size_t' (aka 'unsigned long') vs 'int')
 2732 |     .isend = anpNetIsend,
      |              ^~~~~~~~~~~
src/net_ib.cc:2733:14: error: cannot initialize a member subobject of type 'ncclResult_t (*)(void *, int, void **, size_t *, int *, void **, void **)' (aka 'ncclResult_t (*)(void *, int, void **, unsigned long *, int *, void **, void **)') with an lvalue of type 'ncclResult_t (void *, int, void **, int *, int *, void **, void **)': type mismatch at 4th parameter ('size_t *' (aka 'unsigned long *') vs 'int *')
 2733 |     .irecv = anpNetIrecv,
      |              ^~~~~~~~~~~
4 errors generated when compiling for gfx942.
failed to execute:/opt/rocm-6.4.0/lib/llvm/bin/clang++  --offload-arch=gfx942 --offload-arch=gfx942 --offload-arch=gfx942 --offload-arch=gfx942 --offload-arch=gfx942 --offload-arch=gfx942 --offload-arch=gfx942 --offload-arch=gfx942  -fPIC -g -O2 -O3 -DNDEBUG -Werror -MMD -MP -DTARGET_PLUGIN -DNCCL_BUILD_RDMA_CORE -UANP_TELEMETRY_ENABLED -D__HIP_PLATFORM_AMD__ -Iinclude -I/opt/rocm/include -I/usr/include -I/home/ubuntu/github/rccl/build/include -I/home/ubuntu/github/rccl/build/hipify/src -I/home/ubuntu/github/rccl/build/hipify/src/include -I/opt/ompi/include -c -x hip src/net_ib.cc -o "build/src/net_ib.o"
make: *** [Makefile:77: build/src/net_ib.o] Error 1

Can you please investigate this issue for AMD-ANP setup?
Thank you.

Operating System

Ubuntu 22.04.5 LTS

CPU

AMD EPYC 9965 192-Core Processor

GPU

AMD Instinct MI325X

ROCm Version

ROCm 6.4.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions