Skip to content

horovodrun command reports an error and cannot run the examples #26

@FyuNaru

Description

@FyuNaru

I installed the environment exactly according to the steps in INSTALLING.md. When I used the commands in TRAINING.md to test, the following error occurred


horovodrun -np 2 -H 192.168.31.6:2 --verbose python examples/torch/pytorch_mnist.py


output:

Filtering local host names.
Remote host found:
All hosts are local, finding the interfaces with address 127.0.0.1
Local interface found lo
mpirun --allow-run-as-root --tag-output -np 2 -H 192.168.31.6:2 -bind-to none -map-by slot -mca btl_tcp_if_include lo -x NCCL_SOCKET_IFNAME=lo -x ADDR2LINE -x AR -x AS -x BROWSER -x CC -x CFLAGS -x CMAKE_PREFIX_PATH -x COLORTERM -x CONDA_BACKUP_ADDR2LINE -x CONDA_BACKUP_AR -x CONDA_BACKUP_AS -x CONDA_BACKUP_CC -x CONDA_BACKUP_CFLAGS -x CONDA_BACKUP_CMAKE_PREFIX_PATH -x CONDA_BACKUP_CONDA_BUILD_SYSROOT -x CONDA_BACKUP_CPP -x CONDA_BACKUP_CPPFLAGS -x CONDA_BACKUP_CXX -x CONDA_BACKUP_CXXFILT -x CONDA_BACKUP_CXXFLAGS -x CONDA_BACKUP_DEBUG_CFLAGS -x CONDA_BACKUP_DEBUG_CPPFLAGS -x CONDA_BACKUP_DEBUG_CXXFLAGS -x CONDA_BACKUP_ELFEDIT -x CONDA_BACKUP_GCC -x CONDA_BACKUP_GCC_AR -x CONDA_BACKUP_GCC_NM -x CONDA_BACKUP_GCC_RANLIB -x CONDA_BACKUP_GPROF -x CONDA_BACKUP_GXX -x CONDA_BACKUP_HOST -x CONDA_BACKUP_LD -x CONDA_BACKUP_LDFLAGS -x CONDA_BACKUP_LD_GOLD -x CONDA_BACKUP_NM -x CONDA_BACKUP_OBJCOPY -x CONDA_BACKUP_OBJDUMP -x CONDA_BACKUP_RANLIB -x CONDA_BACKUP_READELF -x CONDA_BACKUP_SIZE -x CONDA_BACKUP_STRINGS -x CONDA_BACKUP_STRIP -x CONDA_BACKUP__CONDA_PYTHON_SYSCONFIGDATA_NAME -x CONDA_BUILD_SYSROOT -x CONDA_CUPY_CUDA_PATH -x CONDA_DEFAULT_ENV -x CONDA_EXE -x CONDA_PREFIX -x CONDA_PREFIX_1 -x CONDA_PREFIX_2 -x CONDA_PREFIX_3 -x CONDA_PREFIX_4 -x CONDA_PREFIX_5 -x CONDA_PREFIX_6 -x CONDA_PREFIX_7 -x CONDA_PROMPT_MODIFIER -x CONDA_PYTHON_EXE -x CONDA_SHLVL -x CPP -x CPPFLAGS -x CUDA_PATH -x CXX -x CXXFILT -x CXXFLAGS -x DBUS_SESSION_BUS_ADDRESS -x DEBUG_CFLAGS -x DEBUG_CPPFLAGS -x DEBUG_CXXFLAGS -x ELFEDIT -x GCC -x GCC_AR -x GCC_NM -x GCC_RANLIB -x GIT_ASKPASS -x GPROF -x GXX -x HOME -x HOROVOD_CCL_BGT_AFFINITY -x HOROVOD_GLOO_TIMEOUT_SECONDS -x HOROVOD_NUM_NCCL_STREAMS -x HOROVOD_STALL_CHECK_TIME_SECONDS -x HOROVOD_STALL_SHUTDOWN_TIME_SECONDS -x HOST -x LANG -x LANGUAGE -x LD -x LDFLAGS -x LD_GOLD -x LESSCLOSE -x LESSOPEN -x LOGNAME -x LS_COLORS -x MOTD_SHOWN -x NCCL_SOCKET_IFNAME -x NM -x OBJCOPY -x OBJDUMP -x PATH -x PWD -x RANLIB -x READELF -x SHELL -x SHLVL -x SIZE -x SSH_CLIENT -x SSH_CONNECTION -x STRINGS -x STRIP -x TERM -x TERM_PROGRAM -x TERM_PROGRAM_VERSION -x USER -x VSCODE_GIT_ASKPASS_EXTRA_ARGS -x VSCODE_GIT_ASKPASS_MAIN -x VSCODE_GIT_ASKPASS_NODE -x VSCODE_GIT_IPC_HANDLE -x VSCODE_IPC_HOOK_CLI -x XDG_DATA_DIRS -x XDG_RUNTIME_DIR -x XDG_SESSION_CLASS -x XDG_SESSION_ID -x XDG_SESSION_TYPE -x _ -x _CE_CONDA -x _CE_M -x _CONDA_PYTHON_SYSCONFIGDATA_NAME python examples/torch/pytorch_mnist.py
[mpiexec@gpu-server-1] match_arg (lib/utils/args.c:166): unrecognized argument allow-run-as-root
[mpiexec@gpu-server-1] HYDU_parse_array (lib/utils/args.c:181): argument matching returned error
[mpiexec@gpu-server-1] parse_args (mpiexec/get_parameters.c:315): error parsing input array
[mpiexec@gpu-server-1] HYD_uii_mpx_get_parameters (mpiexec/get_parameters.c:47): unable to parse user arguments
[mpiexec@gpu-server-1] main (mpiexec/mpiexec.c:49): error parsing parameters


It is difficult to find a solution to this error on the Internet. I speculate that the version of mpi is too new. When I use the mpirun --version command, the version of mpi I get is 4.1.1. But I don't know how to solve this problem. I tried various solutions, such as replacing an older server with a completely different configuration, but the same problem still occurred

Hope to get your help, thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions