Skip to content
This repository was archived by the owner on Nov 19, 2025. It is now read-only.
This repository was archived by the owner on Nov 19, 2025. It is now read-only.

PPOTrainer can't be imported when the slurm job has more than one node #541

@mrm-196

Description

@mrm-196

I tried a very minimal python script which only imports PPOTrainer on a slurm job with more than 1 node. Full script:

from nemo_aligner.algorithms.ppo import PPOTrainer

However, this leads to downstream dependencies of PPOTrainer failing. Error logs:

pynemo/0 [NeMo I 2025-04-30 22:15:36 nemo_logging:393] Importing NeMo-Aligner sets DISABLE_TORCH_DEVICE_SET=1 to disable device reassignment within TensorRT-LLM
pynemo/0 [gcp5-sdc-2:1952597] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-2:1952600] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-2:1952606] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-2:1952618] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-2:1952630] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [NeMo W 2025-04-30 22:15:49 nemo_logging:405] Please use the EncDecSpeakerLabelModel instead of this model. EncDecClassificationModel model is kept for backward compatibility with older models.
pynemo/0 [gcp5-sdc-3:1950659] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-2:1952629] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-2:1952614] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-2:1952588] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-3:1950638] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-3:1950653] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-3:1950647] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-3:1950630] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-3:1950625] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-3:1950641] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-3:1950617] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [1746051351.793592] [gcp5-sdc-2:1952597:0]            sock.c:334  UCX  ERROR   connect(fd=168, dest_addr=169.254.4.6:56945) failed: Connection refused
pynemo/0 [gcp5-sdc-2:1952597] pml_ucx.c:424  Error: ucp_ep_create(proc=8) failed: Destination is unreachable
pynemo/0 [gcp5-sdc-2:1952597] pml_ucx.c:477  Error: Failed to resolve UCX endpoint for rank 8
pynemo/0 [gcp5-sdc-2:1952597:0:1952597] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x300073726fd2)
pynemo/0 ==== backtrace (tid:1952597) ====
pynemo/0  0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7e8e61508654]
pynemo/0  1  /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7e8e6150884c]
pynemo/0  2  /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7e8e61508a88]
pynemo/0  3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7e8e6cb65330]
pynemo/0  4  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_test_all+0x48) [0x7e8e6c1e59e8]
pynemo/0  5  /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x39ac) [0x7e87d24069ac]
pynemo/0  6  /opt/hpcx/ucc/lib/libucc.so.1(ucc_core_addr_exchange+0x3c) [0x7e8e044a6a5c]
pynemo/0  7  /opt/hpcx/ucc/lib/libucc.so.1(ucc_context_create_proc_info+0x7e7) [0x7e8e044a7657]
pynemo/0  8  /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x3cb1) [0x7e87d2406cb1]
pynemo/0  9  /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(mca_coll_ucc_comm_query+0x6f) [0x7e87d240883f]
pynemo/0 10  /opt/hpcx/ompi/lib/libmpi.so.40(+0x85e4c) [0x7e8e6c217e4c]
pynemo/0 11  /opt/hpcx/ompi/lib/libmpi.so.40(mca_coll_base_comm_select+0x76) [0x7e8e6c218446]
pynemo/0 12  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_mpi_init+0xee3) [0x7e8e6c265613]
pynemo/0 13  /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Init_thread+0x81) [0x7e8e6c208d71]
pynemo/0 14  /usr/local/lib/python3.12/dist-packages/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x33651) [0x7e87d4618651]
pynemo/0 15  /usr/local/lib/python3.12/dist-packages/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x33cef) [0x7e87d4618cef]
pynemo/0 16  python(PyModule_ExecDef+0x17f) [0x582c5f]
pynemo/0 17  python() [0x5fd904]
pynemo/0 18  python() [0x581fa2]
pynemo/0 19  python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 20  python() [0x549f57]
pynemo/0 21  python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 22  python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 23  python() [0x5d37c4]
pynemo/0 24  python() [0x581f0d]
pynemo/0 25  python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 26  python() [0x549f57]
pynemo/0 27  python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 28  python(PyImport_ImportModuleLevelObject+0x5eb) [0x5fdfcb]
pynemo/0 29  python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 30  python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 31  python() [0x5d347c]
pynemo/0 32  python() [0x581f0d]
pynemo/0 33  python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 34  python() [0x549f57]
pynemo/0 35  python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 36  python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 37  python() [0x5d37c4]
pynemo/0 38  python() [0x581f0d]
pynemo/0 39  python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 40  python() [0x549f57]
pynemo/0 41  python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 42  python(PyImport_ImportModuleLevelObject+0x5eb) [0x5fdfcb]
pynemo/0 43  python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 44  python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 45  python() [0x5d347c]
pynemo/0 46  python() [0x581f0d]
pynemo/0 47  python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 48  python() [0x549f57]
pynemo/0 49  python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 50  python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 51  python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 52  python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 53  python() [0x5d347c]
pynemo/0 54  python() [0x581f0d]
pynemo/0 55  python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 56  python() [0x549f57]
pynemo/0 57  python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 58  python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 59  python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 60  python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 61  python() [0x5d347c]
pynemo/0 =================================
pynemo/0 [1746051351.812265] [gcp5-sdc-3:1950659:0]            sock.c:334  UCX  ERROR   connect(fd=168, dest_addr=169.254.4.6:52595) failed: Connection refused
pynemo/0 [gcp5-sdc-3:1950659] pml_ucx.c:424  Error: ucp_ep_create(proc=0) failed: Destination is unreachable
pynemo/0 [gcp5-sdc-3:1950659] pml_ucx.c:477  Error: Failed to resolve UCX endpoint for rank 0
pynemo/0 [gcp5-sdc-3:1950659:0:1950659] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x300073726fd2)
pynemo/0 ==== backtrace (tid:1950659) ====
pynemo/0  0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7dbfbfb08654]
pynemo/0  1  /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7dbfbfb0884c]
pynemo/0  2  /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7dbfbfb08a88]
pynemo/0  3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7dbfcb180330]
pynemo/0  4  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_test_all+0x48) [0x7dbfca8009e8]
pynemo/0  5  /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x39ac) [0x7db930a0f9ac]
pynemo/0  6  /opt/hpcx/ucc/lib/libucc.so.1(ucc_core_addr_exchange+0x3c) [0x7dbfc997da5c]
pynemo/0  7  /opt/hpcx/ucc/lib/libucc.so.1(ucc_context_create_proc_info+0x7e7) [0x7dbfc997e657]
pynemo/0  8  /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x3cb1) [0x7db930a0fcb1]
pynemo/0  9  /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(mca_coll_ucc_comm_query+0x6f) [0x7db930a1183f]
pynemo/0 10  /opt/hpcx/ompi/lib/libmpi.so.40(+0x85e4c) [0x7dbfca832e4c]
pynemo/0 11  /opt/hpcx/ompi/lib/libmpi.so.40(mca_coll_base_comm_select+0x76) [0x7dbfca833446]
pynemo/0 12  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_mpi_init+0xee3) [0x7dbfca880613]
pynemo/0 13  /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Init_thread+0x81) [0x7dbfca823d71]
pynemo/0 14  /usr/local/lib/python3.12/dist-packages/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x33651) [0x7db932c3a651]
pynemo/0 15  /usr/local/lib/python3.12/dist-packages/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x33cef) [0x7db932c3acef]
pynemo/0 16  python(PyModule_ExecDef+0x17f) [0x582c5f]
pynemo/0 17  python() [0x5fd904]
pynemo/0 18  python() [0x581fa2]
pynemo/0 19  python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 20  python() [0x549f57]
pynemo/0 21  python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 22  python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 23  python() [0x5d37c4]
pynemo/0 24  python() [0x581f0d]
pynemo/0 25  python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 26  python() [0x549f57]
pynemo/0 27  python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 28  python(PyImport_ImportModuleLevelObject+0x5eb) [0x5fdfcb]
pynemo/0 29  python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 30  python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 31  python() [0x5d347c]
pynemo/0 32  python() [0x581f0d]
pynemo/0 33  python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 34  python() [0x549f57]
pynemo/0 35  python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 36  python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 37  python() [0x5d37c4]
pynemo/0 38  python() [0x581f0d]
pynemo/0 39  python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 40  python() [0x549f57]
pynemo/0 41  python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 42  python(PyImport_ImportModuleLevelObject+0x5eb) [0x5fdfcb]
pynemo/0 43  python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 44  python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 45  python() [0x5d347c]
pynemo/0 46  python() [0x581f0d]
pynemo/0 47  python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 48  python() [0x549f57]
pynemo/0 49  python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 50  python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 51  python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 52  python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 53  python() [0x5d347c]
pynemo/0 54  python() [0x581f0d]
pynemo/0 55  python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 56  python() [0x549f57]
pynemo/0 57  python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 58  python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 59  python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 60  python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 61  python() [0x5d347c]
pynemo/0 =================================
pynemo/0 /nemo_run/scripts/nemo.sh: line 2: 1952597 Segmentation fault      (core dumped) python dummy.py

It's also worth mentioning that DPOTrainer doesn't have this issue on multiple nodes and works as expected. Also, the PPOTrainer works fine on a single node and the issue arises when the number of nodes goes beyond 1.

Environment overview (please complete the following information)

  • Environment location: Docker
  • If method of install is docker pull nvcr.io/nvidia/nemo:25.04.rc1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions