-
Notifications
You must be signed in to change notification settings - Fork 110
[EP] debug AMD mem consistency issues #600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
cc @YangZhou1997 @zhenhuang12 |
Hi @MaoZiming ,Thanks for your pull request! I've noticed two suitations:
|
|
@zhenhuang12 Thank you! I tried it today. |
|
More specifically, I saw |
… into ep-efa-stability
[EP] EFA stability test
|
@zhenhuang12 @MaoZiming , I tried to make a few edits on RDMA config at https://github.com/uccl-project/uccl/pull/619/files. Now I find that the run always crashes on "transport retry counter exceeded" in our Vultr AMD+Broadcom cluster, as also observed in #617. This error usually means too many packets get lost in the network, and the NIC cannot handle them well. So I guess the root cause might be the Broadcom NIC or the network's lossless configuration. Can you give my branch |
Thanks! I will take a try on MI325X + Broadcom soon. |
Hi @YangZhou1997, I ran 4N pressure test, but it failed at seed 0 with cc @MaoZiming |
|
Hi @YangZhou1997 @MaoZiming , I add |
| RUN python${PY_VER} -m pip install --no-cache-dir build auditwheel pybind11 | ||
|
|
||
| RUN python${PY_VER} -m pip install --no-cache-dir --pre torch torchvision \ | ||
| --index-url https://download.pytorch.org/whl/nightly/rocm7.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does rocm7.0 vs. rocm7.1 matter?
Description
Please include a summary of the changes and the related issue.
post_atomic_operations_native_rdmawhich directly posts RDMA atomic. This on its own posting to cuda/hipHostMalloc won't solve the problem.cudaMallocfor atomic_buffer_ptr_SOFTWARE_ORDERINGto test reordering for non-EFA platform. Currently disabled as I found this is not the problem.Fixes # (issue)
--pressure-test-modeType of Change
How Has This Been Tested?
Include any tests here.
Checklist
format.sh.build_and_install.shto verify compilation.