ucx: meet at the pmix fence before disconnecting to avoid an infinite loop #13519

jeking3 · 2025-11-14T18:08:42Z

While moving a job from a small number of GPU nodes to a larger number of CPU nodes, I was able to reliably reproduce #11087 in my environment. While in the debugger, I found that opal_common_ucx_mca_pmix_fence was spinning forever waiting to become fenced. Calls down to the UCX layer showed that it had no pending operations, no active endpoints, and no outstanding flushes. Given that UCX is the transport that allows them to synchronize in this case, it doesn't make any sense to fence after disconnecting. Reversing the order of operations resolved the shutdown hang.

This fixes #11087
This might fix openucx/ucx#8738

github-actions · 2025-11-14T18:09:11Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

b633c68: ucx: meet at barrier before disconnecting, not aft...

check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

jsquyres · 2025-11-14T19:00:51Z

@jeking3 Thanks for the PR! Can you add the sign-off message (e.g., via git commit --amend -s)?

https://docs.open-mpi.org/en/v5.0.x/contributing.html#open-source-contributions

Call pmix_fence while we still have connectivity, because after we disconnect we may never get to being fenced. Signed-off-by: Jim King <jimk@nvidia.com>

jeking3 · 2025-11-21T22:06:04Z

@jeking3 Thanks for the PR! Can you add the sign-off message (e.g., via git commit --amend -s)?

https://docs.open-mpi.org/en/v5.0.x/contributing.html#open-source-contributions

That was done, by the way.

bosilca

I understand this PR seems to address an issueMPI_Finalize issue, but I don't think this is really the case. Instead, it is hiding away the real root cause.

Let me delve a little into the logic here. The main reason for the PMIX fence was to give time (provided by a fence via an external communication framework instead of a barrier) for all processes to close and cleanup their connections/endpoints before returning from the call. To state this clearly, once any process returns from the opal_common_ucx_del_procs we have a guarantee that all processes have destroyed all their connections and joined the fence, a strong guarantee for the rest of the OMPI teardown.

With this change we are now at the complete opposite, because processes synchronize before starting to tear down their connections but then the return of a process does not provide any global guarantee about the others. In this particular scenario there is no need for a fence, a simple barrier on the communicator to be destroyed (MPI_COMM_WORLD I think) would provide the same synchronization. Clearly this is at the opposite of how this entire stage is expected to work.

jeking3 · 2025-11-25T15:26:21Z

Thanks for the explanation. It has solved the shutdown issue in 100s of runs compared to previously where it failed almost all the time, going into an infinite loop. Perhaps the second call to opal_common_ucx_mca_pmix_fence is unnecessary, and perhaps even erroneous, given your description of opal_common_ucx_del_procs.

bosilca · 2025-11-26T15:57:52Z

More fences was never a solid way to fix conceptual bugs. We care about scale, just adding more fences on the least performance network is not desirable, especially at scale. I know a lot of people internally (including myself) who run similar jobs regularly and never had any issues with this. We might need to dig a little more to understand the root cause.

github-actions bot added the Target: main label Nov 14, 2025

jsquyres requested review from bosilca and janjust November 14, 2025 19:01

ucx: meet at barrier before disconnecting, not after

9538ee6

Call pmix_fence while we still have connectivity, because after we disconnect we may never get to being fenced. Signed-off-by: Jim King <jimk@nvidia.com>

jeking3 force-pushed the fix-11087 branch from b633c68 to 9538ee6 Compare November 14, 2025 19:53

jeking3 mentioned this pull request Nov 14, 2025

TCP transport hangs during cleanup openucx/ucx#8738

Open

bosilca requested changes Nov 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ucx: meet at the pmix fence before disconnecting to avoid an infinite loop #13519

ucx: meet at the pmix fence before disconnecting to avoid an infinite loop #13519

jeking3 commented Nov 14, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 14, 2025

Uh oh!

jsquyres commented Nov 14, 2025

Uh oh!

jeking3 commented Nov 21, 2025

Uh oh!

bosilca left a comment

Uh oh!

jeking3 commented Nov 25, 2025 •

edited

Loading

Uh oh!

bosilca commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ucx: meet at the pmix fence before disconnecting to avoid an infinite loop #13519

Are you sure you want to change the base?

ucx: meet at the pmix fence before disconnecting to avoid an infinite loop #13519

Conversation

jeking3 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 14, 2025

Uh oh!

jsquyres commented Nov 14, 2025

Uh oh!

jeking3 commented Nov 21, 2025

Uh oh!

bosilca left a comment

Choose a reason for hiding this comment

Uh oh!

jeking3 commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bosilca commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeking3 commented Nov 14, 2025 •

edited

Loading

jeking3 commented Nov 25, 2025 •

edited

Loading