This repository was archived by the owner on Nov 19, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 106
This repository was archived by the owner on Nov 19, 2025. It is now read-only.
PPOTrainer can't be imported when the slurm job has more than one node #541
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't working
Description
I tried a very minimal python script which only imports PPOTrainer on a slurm job with more than 1 node. Full script:
from nemo_aligner.algorithms.ppo import PPOTrainerHowever, this leads to downstream dependencies of PPOTrainer failing. Error logs:
pynemo/0 [NeMo I 2025-04-30 22:15:36 nemo_logging:393] Importing NeMo-Aligner sets DISABLE_TORCH_DEVICE_SET=1 to disable device reassignment within TensorRT-LLM
pynemo/0 [gcp5-sdc-2:1952597] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-2:1952600] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-2:1952606] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-2:1952618] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-2:1952630] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [NeMo W 2025-04-30 22:15:49 nemo_logging:405] Please use the EncDecSpeakerLabelModel instead of this model. EncDecClassificationModel model is kept for backward compatibility with older models.
pynemo/0 [gcp5-sdc-3:1950659] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-2:1952629] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-2:1952614] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-2:1952588] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-3:1950638] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-3:1950653] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-3:1950647] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-3:1950630] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-3:1950625] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-3:1950641] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [gcp5-sdc-3:1950617] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
pynemo/0 [1746051351.793592] [gcp5-sdc-2:1952597:0] sock.c:334 UCX ERROR connect(fd=168, dest_addr=169.254.4.6:56945) failed: Connection refused
pynemo/0 [gcp5-sdc-2:1952597] pml_ucx.c:424 Error: ucp_ep_create(proc=8) failed: Destination is unreachable
pynemo/0 [gcp5-sdc-2:1952597] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 8
pynemo/0 [gcp5-sdc-2:1952597:0:1952597] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x300073726fd2)
pynemo/0 ==== backtrace (tid:1952597) ====
pynemo/0 0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7e8e61508654]
pynemo/0 1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7e8e6150884c]
pynemo/0 2 /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7e8e61508a88]
pynemo/0 3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7e8e6cb65330]
pynemo/0 4 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_test_all+0x48) [0x7e8e6c1e59e8]
pynemo/0 5 /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x39ac) [0x7e87d24069ac]
pynemo/0 6 /opt/hpcx/ucc/lib/libucc.so.1(ucc_core_addr_exchange+0x3c) [0x7e8e044a6a5c]
pynemo/0 7 /opt/hpcx/ucc/lib/libucc.so.1(ucc_context_create_proc_info+0x7e7) [0x7e8e044a7657]
pynemo/0 8 /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x3cb1) [0x7e87d2406cb1]
pynemo/0 9 /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(mca_coll_ucc_comm_query+0x6f) [0x7e87d240883f]
pynemo/0 10 /opt/hpcx/ompi/lib/libmpi.so.40(+0x85e4c) [0x7e8e6c217e4c]
pynemo/0 11 /opt/hpcx/ompi/lib/libmpi.so.40(mca_coll_base_comm_select+0x76) [0x7e8e6c218446]
pynemo/0 12 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_mpi_init+0xee3) [0x7e8e6c265613]
pynemo/0 13 /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Init_thread+0x81) [0x7e8e6c208d71]
pynemo/0 14 /usr/local/lib/python3.12/dist-packages/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x33651) [0x7e87d4618651]
pynemo/0 15 /usr/local/lib/python3.12/dist-packages/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x33cef) [0x7e87d4618cef]
pynemo/0 16 python(PyModule_ExecDef+0x17f) [0x582c5f]
pynemo/0 17 python() [0x5fd904]
pynemo/0 18 python() [0x581fa2]
pynemo/0 19 python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 20 python() [0x549f57]
pynemo/0 21 python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 22 python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 23 python() [0x5d37c4]
pynemo/0 24 python() [0x581f0d]
pynemo/0 25 python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 26 python() [0x549f57]
pynemo/0 27 python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 28 python(PyImport_ImportModuleLevelObject+0x5eb) [0x5fdfcb]
pynemo/0 29 python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 30 python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 31 python() [0x5d347c]
pynemo/0 32 python() [0x581f0d]
pynemo/0 33 python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 34 python() [0x549f57]
pynemo/0 35 python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 36 python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 37 python() [0x5d37c4]
pynemo/0 38 python() [0x581f0d]
pynemo/0 39 python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 40 python() [0x549f57]
pynemo/0 41 python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 42 python(PyImport_ImportModuleLevelObject+0x5eb) [0x5fdfcb]
pynemo/0 43 python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 44 python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 45 python() [0x5d347c]
pynemo/0 46 python() [0x581f0d]
pynemo/0 47 python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 48 python() [0x549f57]
pynemo/0 49 python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 50 python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 51 python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 52 python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 53 python() [0x5d347c]
pynemo/0 54 python() [0x581f0d]
pynemo/0 55 python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 56 python() [0x549f57]
pynemo/0 57 python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 58 python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 59 python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 60 python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 61 python() [0x5d347c]
pynemo/0 =================================
pynemo/0 [1746051351.812265] [gcp5-sdc-3:1950659:0] sock.c:334 UCX ERROR connect(fd=168, dest_addr=169.254.4.6:52595) failed: Connection refused
pynemo/0 [gcp5-sdc-3:1950659] pml_ucx.c:424 Error: ucp_ep_create(proc=0) failed: Destination is unreachable
pynemo/0 [gcp5-sdc-3:1950659] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 0
pynemo/0 [gcp5-sdc-3:1950659:0:1950659] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x300073726fd2)
pynemo/0 ==== backtrace (tid:1950659) ====
pynemo/0 0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7dbfbfb08654]
pynemo/0 1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7dbfbfb0884c]
pynemo/0 2 /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7dbfbfb08a88]
pynemo/0 3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7dbfcb180330]
pynemo/0 4 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_test_all+0x48) [0x7dbfca8009e8]
pynemo/0 5 /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x39ac) [0x7db930a0f9ac]
pynemo/0 6 /opt/hpcx/ucc/lib/libucc.so.1(ucc_core_addr_exchange+0x3c) [0x7dbfc997da5c]
pynemo/0 7 /opt/hpcx/ucc/lib/libucc.so.1(ucc_context_create_proc_info+0x7e7) [0x7dbfc997e657]
pynemo/0 8 /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x3cb1) [0x7db930a0fcb1]
pynemo/0 9 /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(mca_coll_ucc_comm_query+0x6f) [0x7db930a1183f]
pynemo/0 10 /opt/hpcx/ompi/lib/libmpi.so.40(+0x85e4c) [0x7dbfca832e4c]
pynemo/0 11 /opt/hpcx/ompi/lib/libmpi.so.40(mca_coll_base_comm_select+0x76) [0x7dbfca833446]
pynemo/0 12 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_mpi_init+0xee3) [0x7dbfca880613]
pynemo/0 13 /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Init_thread+0x81) [0x7dbfca823d71]
pynemo/0 14 /usr/local/lib/python3.12/dist-packages/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x33651) [0x7db932c3a651]
pynemo/0 15 /usr/local/lib/python3.12/dist-packages/mpi4py/MPI.cpython-312-x86_64-linux-gnu.so(+0x33cef) [0x7db932c3acef]
pynemo/0 16 python(PyModule_ExecDef+0x17f) [0x582c5f]
pynemo/0 17 python() [0x5fd904]
pynemo/0 18 python() [0x581fa2]
pynemo/0 19 python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 20 python() [0x549f57]
pynemo/0 21 python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 22 python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 23 python() [0x5d37c4]
pynemo/0 24 python() [0x581f0d]
pynemo/0 25 python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 26 python() [0x549f57]
pynemo/0 27 python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 28 python(PyImport_ImportModuleLevelObject+0x5eb) [0x5fdfcb]
pynemo/0 29 python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 30 python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 31 python() [0x5d347c]
pynemo/0 32 python() [0x581f0d]
pynemo/0 33 python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 34 python() [0x549f57]
pynemo/0 35 python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 36 python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 37 python() [0x5d37c4]
pynemo/0 38 python() [0x581f0d]
pynemo/0 39 python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 40 python() [0x549f57]
pynemo/0 41 python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 42 python(PyImport_ImportModuleLevelObject+0x5eb) [0x5fdfcb]
pynemo/0 43 python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 44 python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 45 python() [0x5d347c]
pynemo/0 46 python() [0x581f0d]
pynemo/0 47 python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 48 python() [0x549f57]
pynemo/0 49 python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 50 python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 51 python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 52 python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 53 python() [0x5d347c]
pynemo/0 54 python() [0x581f0d]
pynemo/0 55 python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
pynemo/0 56 python() [0x549f57]
pynemo/0 57 python(PyObject_CallMethodObjArgs+0xe3) [0x54b7e3]
pynemo/0 58 python(PyImport_ImportModuleLevelObject+0x395) [0x5fdd75]
pynemo/0 59 python(_PyEval_EvalFrameDefault+0x5b9c) [0x5dc4dc]
pynemo/0 60 python(PyEval_EvalCode+0x15b) [0x5d58eb]
pynemo/0 61 python() [0x5d347c]
pynemo/0 =================================
pynemo/0 /nemo_run/scripts/nemo.sh: line 2: 1952597 Segmentation fault (core dumped) python dummy.py
It's also worth mentioning that DPOTrainer doesn't have this issue on multiple nodes and works as expected. Also, the PPOTrainer works fine on a single node and the issue arises when the number of nodes goes beyond 1.
Environment overview (please complete the following information)
- Environment location: Docker
- If method of install is
docker pull nvcr.io/nvidia/nemo:25.04.rc1
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working