Skip to content

Conversation

@MaoZiming
Copy link
Member

@MaoZiming MaoZiming commented Dec 23, 2025

Description

Please include a summary of the changes and the related issue.

  • added post_atomic_operations_native_rdma which directly posts RDMA atomic. This on its own posting to cuda/hipHostMalloc won't solve the problem.
  • using cudaMalloc for atomic_buffer_ptr_
  • An optional flag SOFTWARE_ORDERING to test reordering for non-EFA platform. Currently disabled as I found this is not the problem.

Fixes # (issue)

  • Stuck on 4-node MI325x test bed with --pressure-test-mode

Type of Change

  • Bug fix
  • New feature
  • Documentation update

How Has This Been Tested?

Include any tests here.

  • Unit tests
  • Integration tests
  • Manual testing

Checklist

  • My code follows the style guidelines, e.g. format.sh.
  • I have run build_and_install.sh to verify compilation.
  • I have removed redundant variables and comments.
  • I have updated the documentation.
  • I have added tests.

@MaoZiming
Copy link
Member Author

cc @YangZhou1997 @zhenhuang12
I think the stability issue on AMD+CX7 seems to be solved with just posting RDMA atomic + using cuda/hipMalloc for atomic buffer. Previously the issue can be replicated on 4 nodes (not 2 nodes) and using the --pressure-test-mode flag for test_internode.py.
For context, we were using CPU-emulated atomic + cuda/hipHostMalloc as EFA NICs do not support atomics.
My suspicion is that cuda/hipHostMalloc is somehow messing up with the atomic semantics. I am not sure if this problem is present on our H200+EFA testbeds, we might need to double check (when we got the servers back).

@zhenhuang12
Copy link
Collaborator

zhenhuang12 commented Dec 30, 2025

cc @YangZhou1997 @zhenhuang12 I think the stability issue on AMD+CX7 seems to be solved with just posting RDMA atomic + using cuda/hipMalloc for atomic buffer. Previously the issue can be replicated on 4 nodes (not 2 nodes) and using the --pressure-test-mode flag for test_internode.py. For context, we were using CPU-emulated atomic + cuda/hipHostMalloc as EFA NICs do not support atomics. My suspicion is that cuda/hipHostMalloc is somehow messing up with the atomic semantics. I am not sure if this problem is present on our H200+EFA testbeds, we might need to double check (when we got the servers back).

Hi @MaoZiming ,Thanks for your pull request! I've noticed two suitations:

  • I've run pressure test 3 times on cx7 with MI300X, but it got stuck at seed 694 every time. the following is the error log
[rank2]:[E1229 08:59:52.703390097 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 2] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank2]:[E1229 08:59:52.703515470 ProcessGroupNCCL.cpp:1611] [PG ID 0 PG GUID 0(default_pg) Rank 2] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1, only active collectives: 0
[wait_until_cmd_consumed nvl:[wait_until_cmd_consumed nvl:0[wait_until_cmd_consumed nvl:3 cmd:[wait_until_cmd_consumed nvl:4 cmd:2 label: cmd:-144] waiting slot=4 label:[wait_until_cmd_consumed nvl: cmd:57774891-1 label:45
[wait_until_cmd_consumed nvl:-1] waiting slot= label: cmd:] waiting slot=14-15777591157773334 label:] waiting slot= cmd:
-1
577763664] waiting slot=
 label:57775819-1
] waiting slot=[wait_until_cmd_consumed nvl:57774384
6 cmd:4 label:-1] waiting slot=57773361
[wait_until_cmd_consumed nvl:7 cmd:4 label:-1] waiting slot=57774661
  • combine performance improves a lot
Type Combine #EP Bottleneck bandwidth
Internode 16 45 GB/s -> 67 GB/s (RDMA)
Internode 32 50 GB/s -> 60GB/s (RDMA)

cc @YangZhou1997

@MaoZiming
Copy link
Member Author

@zhenhuang12 Thank you! I tried it today.
I did saw receiver timeout, it also shows illegal memory access.

DeepEP dispatch NVL receiver timeout, channel: DeepEP dispatch NVL receiver timeout, channel: DeepEP dispatch NVL receiver timeout, channel: 111, RDMA: , RDMA: , RDMA: 000, nvl: , nvl: , nvl: 304, src NVL: , src NVL: 7DeepEP dispatch NVL receiver timeout, channel: , src NVL: 7, head: DeepEP dispatch NVL receiver timeout, channel: , head: DeepEP dispatch NVL receiver timeout, channel: 1067, tail: , head: 188DeepEP dispatch NVL receiver timeout, channel: 1110689DeepEP dispatch NVL receiver timeout, channel: , tail: , RDMA: , RDMA: 1, RDMA: , num_tokens_to_recv_original: , tail: 88891000, RDMA: 137, num_tokens_to_recv_original: , num_tokens_to_recv_original: 113, RDMA: , nvl: , nvl: , nvl: 0, last_recv_token_idx: 108, last_recv_token_idx: 0257, nvl: 28841, last_recv_token_idx: 2879428951, nvl: , src NVL: , src NVL: 1, src NVL: 6, next_expected_token_idx: , next_expected_token_idx: , next_expected_token_idx: 77, src NVL: 7, src NVL: 288422879528952
, head: , head: 7, head: 789

10583, head: , head: , tail: , tail: , tail: 8589102, tail: 10583, num_tokens_to_recv_original: , tail: 85, num_tokens_to_recv_original: , num_tokens_to_recv_original: 109102, num_tokens_to_recv_original: 131107, last_recv_token_idx: , num_tokens_to_recv_original: 113, last_recv_token_idx: , last_recv_token_idx: 28883127, last_recv_token_idx: 2878428721, next_expected_token_idx: , last_recv_token_idx: 29026, next_expected_token_idx: , next_expected_token_idx: 2888428704, next_expected_token_idx: 2878528722
, next_expected_token_idx: 29027

28705

/home/yangzhou/miniconda3/envs/ziming/lib/python3.12/site-packages/torch/utils/_device.py:109: UserWarning: HIP warning: an illegal memory access was encountered (Triggered internally at /pytorch/aten/src/ATen/hip/impl/HIPGuardImplMasqueradingAsCUDA.h:83.)
  return func(*args, **kwargs)
[rank7]: Traceback (most recent call last):
[rank7]:   File "/home/yangzhou/ziming/uccl/ep/bench/test_internode.py", line 610, in <module>
[rank7]:     test_loop(local_rank, num_local_ranks, num_nodes, args)
[rank7]:   File "/home/yangzhou/ziming/uccl/ep/bench/test_internode.py", line 536, in test_loop
[rank7]:     current_hash += test_main(
[rank7]:                     ^^^^^^^^^^
[rank7]:   File "/home/yangzhou/ziming/uccl/ep/bench/test_internode.py", line 233, in test_main
[rank7]:     hash_value += hash_tensor(recv_x)
[rank7]:                   ^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/home/yangzhou/ziming/uccl/ep/bench/utils.py", line 684, in hash_tensor
[rank7]:     return t.view(torch.int).sum().item()
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/home/yangzhou/miniconda3/envs/ziming/lib/python3.12/site-packages/torch/utils/_device.py", line 109, in __torch_function__
[rank7]:     return func(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^
[rank7]: torch.AcceleratorError: HIP error: an illegal memory access was encountered
[rank7]: Search for `hipErrorIllegalAddress' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
[rank7]: HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank7]: For debugging consider passing AMD_SERIALIZE_KERNEL=3
[rank7]: Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

@MaoZiming
Copy link
Member Author

More specifically, I saw

Kernel Name: _ZN4uccl9internode8dispatchILb0ELi4ELb0ELi16384ELi7ELi4EEEvP15HIP_vector_typeIiLj4EEPfPlS5_PNS0_10SourceMetaEPKS3_PKfPKlSC_PiSF_SF_SF_PKiSH_SH_SH_PKbiiiiiiiPviiPSK_iiiiPKmiSK_
VGPU=0x44287190 SWq=0x7f545cf9c000, HWq=0x7f5400200000, id=5
	Dispatch Header = 0xb02 (type=2, barrier=1, acquire=1, release=1), setup=0
	grid=[65536, 1, 1], workgroup=[1024, 1, 1]
	private_seg_size=0, group_seg_size=268
	kernel_obj=0x7f47426243c0, kernarg_address=0x0x7f535ad00d00
	completion_signal=0x0, correlation_id=0
	rptr=5492, wptr=5495
:0:rocdevice.cpp            :3676: 5731423507945 us:  Callback: Queue 0x7f5400200000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016

@YangZhou1997
Copy link
Member

YangZhou1997 commented Jan 6, 2026

@zhenhuang12 @MaoZiming , I tried to make a few edits on RDMA config at https://github.com/uccl-project/uccl/pull/619/files. Now I find that the run always crashes on "transport retry counter exceeded" in our Vultr AMD+Broadcom cluster, as also observed in #617. This error usually means too many packets get lost in the network, and the NIC cannot handle them well. So I guess the root cause might be the Broadcom NIC or the network's lossless configuration. Can you give my branch ep-debug-amd-mem-consistency-yang a try on your AMD+Broadcom?

@zhenhuang12
Copy link
Collaborator

@zhenhuang12 @MaoZiming , I tried to make a few edits on RDMA config at https://github.com/uccl-project/uccl/pull/619/files. Now I find that the run always crashes on "transport retry counter exceeded" in our Vultr AMD+Broadcom cluster, as also observed in #617. This error usually means too many packets get lost in the network, and the NIC cannot handle them well. So I guess the root cause might be the Broadcom NIC or the network's lossless configuration. Can you give my branch ep-debug-amd-mem-consistency-yang a try on your AMD+Broadcom?

Thanks! I will take a try on MI325X + Broadcom soon.

@zhenhuang12
Copy link
Collaborator

@zhenhuang12 @MaoZiming , I tried to make a few edits on RDMA config at https://github.com/uccl-project/uccl/pull/619/files. Now I find that the run always crashes on "transport retry counter exceeded" in our Vultr AMD+Broadcom cluster, as also observed in #617. This error usually means too many packets get lost in the network, and the NIC cannot handle them well. So I guess the root cause might be the Broadcom NIC or the network's lossless configuration. Can you give my branch ep-debug-amd-mem-consistency-yang a try on your AMD+Broadcom?

Hi @YangZhou1997, I ran 4N pressure test, but it failed at seed 0 with DeepEP dispatch forwarder timeout (RDMA meta) error.

[testing] Running with BF16, with top-k (async=True, previous=True) ...
[testing] Running with FP8, without top-k (async=True, previous=True) ...
[testing] Running with FP8, with top-k (async=True, previous=True) ...


[config] num_tokens=4096, hidden=7168, num_topk_groups=4, num_topk=8
[layout] Kernel performance: 0.077 ms

[testing] Running with BF16, without top-k (async=False, previous=False) ...
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 19, RDMA: 1, nvl: 7, src RDMA: 3, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 31, RDMA: 1, nvl: 5, src RDMA lane: 3, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 19, RDMA: 1, nvl: 6, src RDMA: 3, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 3, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 1, src RDMA: 3, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 7, src RDMA: 3, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 15, RDMA: 1, nvl: 2, src RDMA: 3, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 19, RDMA: 1, nvl: 7, src RDMA: 3, src nvl: 3, start: 0, end: 0

cc @MaoZiming

@zhenhuang12
Copy link
Collaborator

Hi @YangZhou1997 @MaoZiming , I add cudaMemPrefetchAsync back in rocm7.1, it seems like work well. please take a try.

RUN python${PY_VER} -m pip install --no-cache-dir build auditwheel pybind11

RUN python${PY_VER} -m pip install --no-cache-dir --pre torch torchvision \
--index-url https://download.pytorch.org/whl/nightly/rocm7.0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does rocm7.0 vs. rocm7.1 matter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants