[EP] debug AMD mem consistency issues #600

MaoZiming · 2025-12-23T05:01:41Z

Description

Please include a summary of the changes and the related issue.

added post_atomic_operations_native_rdma which directly posts RDMA atomic. This on its own posting to cuda/hipHostMalloc won't solve the problem.
using cudaMalloc for atomic_buffer_ptr_
An optional flag SOFTWARE_ORDERING to test reordering for non-EFA platform. Currently disabled as I found this is not the problem.

Fixes # (issue)

Stuck on 4-node MI325x test bed with --pressure-test-mode

Type of Change

Bug fix
New feature
Documentation update

How Has This Been Tested?

Include any tests here.

Unit tests
Integration tests
Manual testing

Checklist

My code follows the style guidelines, e.g. format.sh.
I have run build_and_install.sh to verify compilation.
I have removed redundant variables and comments.
I have updated the documentation.
I have added tests.

MaoZiming · 2025-12-27T19:48:44Z

cc @YangZhou1997 @zhenhuang12
I think the stability issue on AMD+CX7 seems to be solved with just posting RDMA atomic + using cuda/hipMalloc for atomic buffer. Previously the issue can be replicated on 4 nodes (not 2 nodes) and using the --pressure-test-mode flag for test_internode.py.
For context, we were using CPU-emulated atomic + cuda/hipHostMalloc as EFA NICs do not support atomics.
My suspicion is that cuda/hipHostMalloc is somehow messing up with the atomic semantics. I am not sure if this problem is present on our H200+EFA testbeds, we might need to double check (when we got the servers back).

zhenhuang12 · 2025-12-30T11:55:50Z

cc @YangZhou1997 @zhenhuang12 I think the stability issue on AMD+CX7 seems to be solved with just posting RDMA atomic + using cuda/hipMalloc for atomic buffer. Previously the issue can be replicated on 4 nodes (not 2 nodes) and using the --pressure-test-mode flag for test_internode.py. For context, we were using CPU-emulated atomic + cuda/hipHostMalloc as EFA NICs do not support atomics. My suspicion is that cuda/hipHostMalloc is somehow messing up with the atomic semantics. I am not sure if this problem is present on our H200+EFA testbeds, we might need to double check (when we got the servers back).

Hi @MaoZiming ,Thanks for your pull request! I've noticed two suitations:

I've run pressure test 3 times on cx7 with MI300X, but it got stuck at seed 694 every time. the following is the error log

[rank2]:[E1229 08:59:52.703390097 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 2] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank2]:[E1229 08:59:52.703515470 ProcessGroupNCCL.cpp:1611] [PG ID 0 PG GUID 0(default_pg) Rank 2] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1, only active collectives: 0
[wait_until_cmd_consumed nvl:[wait_until_cmd_consumed nvl:0[wait_until_cmd_consumed nvl:3 cmd:[wait_until_cmd_consumed nvl:4 cmd:2 label: cmd:-144] waiting slot=4 label:[wait_until_cmd_consumed nvl: cmd:57774891-1 label:45
[wait_until_cmd_consumed nvl:-1] waiting slot= label: cmd:] waiting slot=14-15777591157773334 label:] waiting slot= cmd:
-1
577763664] waiting slot=
 label:57775819-1
] waiting slot=[wait_until_cmd_consumed nvl:57774384
6 cmd:4 label:-1] waiting slot=57773361
[wait_until_cmd_consumed nvl:7 cmd:4 label:-1] waiting slot=57774661

combine performance improves a lot

Type	Combine #EP	Bottleneck bandwidth
Internode	16	45 GB/s -> 67 GB/s (RDMA)
Internode	32	50 GB/s -> 60GB/s (RDMA)

cc @YangZhou1997

MaoZiming · 2025-12-30T20:52:49Z

@zhenhuang12 Thank you! I tried it today.
I did saw receiver timeout, it also shows illegal memory access.

DeepEP dispatch NVL receiver timeout, channel: DeepEP dispatch NVL receiver timeout, channel: DeepEP dispatch NVL receiver timeout, channel: 111, RDMA: , RDMA: , RDMA: 000, nvl: , nvl: , nvl: 304, src NVL: , src NVL: 7DeepEP dispatch NVL receiver timeout, channel: , src NVL: 7, head: DeepEP dispatch NVL receiver timeout, channel: , head: DeepEP dispatch NVL receiver timeout, channel: 1067, tail: , head: 188DeepEP dispatch NVL receiver timeout, channel: 1110689DeepEP dispatch NVL receiver timeout, channel: , tail: , RDMA: , RDMA: 1, RDMA: , num_tokens_to_recv_original: , tail: 88891000, RDMA: 137, num_tokens_to_recv_original: , num_tokens_to_recv_original: 113, RDMA: , nvl: , nvl: , nvl: 0, last_recv_token_idx: 108, last_recv_token_idx: 0257, nvl: 28841, last_recv_token_idx: 2879428951, nvl: , src NVL: , src NVL: 1, src NVL: 6, next_expected_token_idx: , next_expected_token_idx: , next_expected_token_idx: 77, src NVL: 7, src NVL: 288422879528952
, head: , head: 7, head: 789

10583, head: , head: , tail: , tail: , tail: 8589102, tail: 10583, num_tokens_to_recv_original: , tail: 85, num_tokens_to_recv_original: , num_tokens_to_recv_original: 109102, num_tokens_to_recv_original: 131107, last_recv_token_idx: , num_tokens_to_recv_original: 113, last_recv_token_idx: , last_recv_token_idx: 28883127, last_recv_token_idx: 2878428721, next_expected_token_idx: , last_recv_token_idx: 29026, next_expected_token_idx: , next_expected_token_idx: 2888428704, next_expected_token_idx: 2878528722
, next_expected_token_idx: 29027

28705

/home/yangzhou/miniconda3/envs/ziming/lib/python3.12/site-packages/torch/utils/_device.py:109: UserWarning: HIP warning: an illegal memory access was encountered (Triggered internally at /pytorch/aten/src/ATen/hip/impl/HIPGuardImplMasqueradingAsCUDA.h:83.)
  return func(*args, **kwargs)
[rank7]: Traceback (most recent call last):
[rank7]:   File "/home/yangzhou/ziming/uccl/ep/bench/test_internode.py", line 610, in <module>
[rank7]:     test_loop(local_rank, num_local_ranks, num_nodes, args)
[rank7]:   File "/home/yangzhou/ziming/uccl/ep/bench/test_internode.py", line 536, in test_loop
[rank7]:     current_hash += test_main(
[rank7]:                     ^^^^^^^^^^
[rank7]:   File "/home/yangzhou/ziming/uccl/ep/bench/test_internode.py", line 233, in test_main
[rank7]:     hash_value += hash_tensor(recv_x)
[rank7]:                   ^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/home/yangzhou/ziming/uccl/ep/bench/utils.py", line 684, in hash_tensor
[rank7]:     return t.view(torch.int).sum().item()
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/home/yangzhou/miniconda3/envs/ziming/lib/python3.12/site-packages/torch/utils/_device.py", line 109, in __torch_function__
[rank7]:     return func(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^
[rank7]: torch.AcceleratorError: HIP error: an illegal memory access was encountered
[rank7]: Search for `hipErrorIllegalAddress' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
[rank7]: HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank7]: For debugging consider passing AMD_SERIALIZE_KERNEL=3
[rank7]: Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

MaoZiming · 2026-01-01T21:33:00Z

More specifically, I saw

Kernel Name: _ZN4uccl9internode8dispatchILb0ELi4ELb0ELi16384ELi7ELi4EEEvP15HIP_vector_typeIiLj4EEPfPlS5_PNS0_10SourceMetaEPKS3_PKfPKlSC_PiSF_SF_SF_PKiSH_SH_SH_PKbiiiiiiiPviiPSK_iiiiPKmiSK_
VGPU=0x44287190 SWq=0x7f545cf9c000, HWq=0x7f5400200000, id=5
	Dispatch Header = 0xb02 (type=2, barrier=1, acquire=1, release=1), setup=0
	grid=[65536, 1, 1], workgroup=[1024, 1, 1]
	private_seg_size=0, group_seg_size=268
	kernel_obj=0x7f47426243c0, kernarg_address=0x0x7f535ad00d00
	completion_signal=0x0, correlation_id=0
	rptr=5492, wptr=5495
:0:rocdevice.cpp            :3676: 5731423507945 us:  Callback: Queue 0x7f5400200000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016

…d mode

… into ep-efa-stability

[EP] EFA stability test

YangZhou1997 · 2026-01-06T01:06:57Z

@zhenhuang12 @MaoZiming , I tried to make a few edits on RDMA config at https://github.com/uccl-project/uccl/pull/619/files. Now I find that the run always crashes on "transport retry counter exceeded" in our Vultr AMD+Broadcom cluster, as also observed in #617. This error usually means too many packets get lost in the network, and the NIC cannot handle them well. So I guess the root cause might be the Broadcom NIC or the network's lossless configuration. Can you give my branch ep-debug-amd-mem-consistency-yang a try on your AMD+Broadcom?

zhenhuang12 · 2026-01-06T01:39:19Z

@zhenhuang12 @MaoZiming , I tried to make a few edits on RDMA config at https://github.com/uccl-project/uccl/pull/619/files. Now I find that the run always crashes on "transport retry counter exceeded" in our Vultr AMD+Broadcom cluster, as also observed in #617. This error usually means too many packets get lost in the network, and the NIC cannot handle them well. So I guess the root cause might be the Broadcom NIC or the network's lossless configuration. Can you give my branch ep-debug-amd-mem-consistency-yang a try on your AMD+Broadcom?

Thanks! I will take a try on MI325X + Broadcom soon.

zhenhuang12 · 2026-01-06T02:08:57Z

@zhenhuang12 @MaoZiming , I tried to make a few edits on RDMA config at https://github.com/uccl-project/uccl/pull/619/files. Now I find that the run always crashes on "transport retry counter exceeded" in our Vultr AMD+Broadcom cluster, as also observed in #617. This error usually means too many packets get lost in the network, and the NIC cannot handle them well. So I guess the root cause might be the Broadcom NIC or the network's lossless configuration. Can you give my branch ep-debug-amd-mem-consistency-yang a try on your AMD+Broadcom?

Hi @YangZhou1997, I ran 4N pressure test, but it failed at seed 0 with DeepEP dispatch forwarder timeout (RDMA meta) error.

[testing] Running with BF16, with top-k (async=True, previous=True) ...
[testing] Running with FP8, without top-k (async=True, previous=True) ...
[testing] Running with FP8, with top-k (async=True, previous=True) ...


[config] num_tokens=4096, hidden=7168, num_topk_groups=4, num_topk=8
[layout] Kernel performance: 0.077 ms

[testing] Running with BF16, without top-k (async=False, previous=False) ...
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 1, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 19, RDMA: 1, nvl: 7, src RDMA: 3, src nvl: 6, start: 0, end: 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 31, RDMA: 1, nvl: 5, src RDMA lane: 3, dst NVL: 6, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 7, meta: 0, 0, 0, 0
DeepEP dispatch forwarder timeout (RDMA meta), channel: 0, RDMA: 1, nvl: 7, src RDMA lane: 3, dst NVL: 0, meta: 0, 0, 0, 0
DeepEP dispatch NVL receiver timeout, channel: 19, RDMA: 1, nvl: 6, src RDMA: 3, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 4, RDMA: 1, nvl: 6, src RDMA: 3, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 8, RDMA: 1, nvl: 1, src RDMA: 3, src nvl: 5, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 0, RDMA: 1, nvl: 7, src RDMA: 3, src nvl: 6, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 15, RDMA: 1, nvl: 2, src RDMA: 3, src nvl: 0, start: 0, end: 0
DeepEP dispatch NVL receiver timeout, channel: 19, RDMA: 1, nvl: 7, src RDMA: 3, src nvl: 3, start: 0, end: 0

cc @MaoZiming

zhenhuang12 · 2026-01-06T03:28:25Z

Hi @YangZhou1997 @MaoZiming , I add cudaMemPrefetchAsync back in rocm7.1, it seems like work well. please take a try.

MaoZiming · 2026-01-06T05:13:43Z

docker/Dockerfile.rocm

 RUN python${PY_VER} -m pip install --no-cache-dir build auditwheel pybind11

 RUN python${PY_VER} -m pip install --no-cache-dir --pre torch torchvision \
-    --index-url https://download.pytorch.org/whl/nightly/rocm7.0


Does rocm7.0 vs. rocm7.1 matter?

MaoZiming added 5 commits December 23, 2025 05:01

ep debug mem consistency issues

3340d74

nit

ce7ab42

used rdma atomic + cudamalloc for atomic on AMD non-EFA

4d37b7c

revert formatting python

565051c

revert formatting other python

f0f15c8

MaoZiming added 6 commits December 27, 2025 20:28

Minor change

71062d0

revert some changes to rdma.cpp

5ba543c

wrap post_atomic_operations_native_rdma under ifdef AMD

760ebdb

format

b7132e5

Merge branch 'main' into ep-debug-amd-mem-consistency

f77006e

combine

47f4396

remove relaxed ordering flag

fb1f535

zhenhuang12 and others added 8 commits January 2, 2026 02:13

change atomic buffer cudaMalloc to hipExtMallocWithFlags with uncache…

aa23177

…d mode

efa stability

63b2203

mask seq

9f1bee7

fix

bfbadc1

fix nits

7f1da83

Merge branch 'ep-efa-stability' of https://github.com/uccl-project/uccl…

a69de86

… into ep-efa-stability

Merge pull request #616 from uccl-project/ep-efa-stability

25546e5

[EP] EFA stability test

revert start barrier_seq

5163c76

add cudaMemPrefetchAsync back.

48f748d

MaoZiming commented Jan 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EP] debug AMD mem consistency issues #600

[EP] debug AMD mem consistency issues #600

Uh oh!

MaoZiming commented Dec 23, 2025 •

edited

Loading

Uh oh!

MaoZiming commented Dec 27, 2025

Uh oh!

zhenhuang12 commented Dec 30, 2025 •

edited

Loading

Uh oh!

MaoZiming commented Dec 30, 2025

Uh oh!

MaoZiming commented Jan 1, 2026

Uh oh!

YangZhou1997 commented Jan 6, 2026 •

edited

Loading

Uh oh!

zhenhuang12 commented Jan 6, 2026

Uh oh!

zhenhuang12 commented Jan 6, 2026

Uh oh!

zhenhuang12 commented Jan 6, 2026

Uh oh!

MaoZiming Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[EP] debug AMD mem consistency issues #600

Are you sure you want to change the base?

[EP] debug AMD mem consistency issues #600

Uh oh!

Conversation

MaoZiming commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

MaoZiming commented Dec 27, 2025

Uh oh!

zhenhuang12 commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaoZiming commented Dec 30, 2025

Uh oh!

MaoZiming commented Jan 1, 2026

Uh oh!

YangZhou1997 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhenhuang12 commented Jan 6, 2026

Uh oh!

zhenhuang12 commented Jan 6, 2026

Uh oh!

zhenhuang12 commented Jan 6, 2026

Uh oh!

MaoZiming Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MaoZiming commented Dec 23, 2025 •

edited

Loading

zhenhuang12 commented Dec 30, 2025 •

edited

Loading

YangZhou1997 commented Jan 6, 2026 •

edited

Loading