-
Notifications
You must be signed in to change notification settings - Fork 937
ucx: meet at the pmix fence before disconnecting to avoid an infinite loop #13519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: b633c68: ucx: meet at barrier before disconnecting, not aft...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
|
@jeking3 Thanks for the PR! Can you add the sign-off message (e.g., via https://docs.open-mpi.org/en/v5.0.x/contributing.html#open-source-contributions |
Call pmix_fence while we still have connectivity, because after we disconnect we may never get to being fenced. Signed-off-by: Jim King <jimk@nvidia.com>
That was done, by the way. |
bosilca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this PR seems to address an issueMPI_Finalize issue, but I don't think this is really the case. Instead, it is hiding away the real root cause.
Let me delve a little into the logic here. The main reason for the PMIX fence was to give time (provided by a fence via an external communication framework instead of a barrier) for all processes to close and cleanup their connections/endpoints before returning from the call. To state this clearly, once any process returns from the opal_common_ucx_del_procs we have a guarantee that all processes have destroyed all their connections and joined the fence, a strong guarantee for the rest of the OMPI teardown.
With this change we are now at the complete opposite, because processes synchronize before starting to tear down their connections but then the return of a process does not provide any global guarantee about the others. In this particular scenario there is no need for a fence, a simple barrier on the communicator to be destroyed (MPI_COMM_WORLD I think) would provide the same synchronization. Clearly this is at the opposite of how this entire stage is expected to work.
|
Thanks for the explanation. It has solved the shutdown issue in 100s of runs compared to previously where it failed almost all the time, going into an infinite loop. Perhaps the second call to |
|
More fences was never a solid way to fix conceptual bugs. We care about scale, just adding more fences on the least performance network is not desirable, especially at scale. I know a lot of people internally (including myself) who run similar jobs regularly and never had any issues with this. We might need to dig a little more to understand the root cause. |
While moving a job from a small number of GPU nodes to a larger number of CPU nodes, I was able to reliably reproduce #11087 in my environment. While in the debugger, I found that
opal_common_ucx_mca_pmix_fencewas spinning forever waiting to become fenced. Calls down to the UCX layer showed that it had no pending operations, no active endpoints, and no outstanding flushes. Given that UCX is the transport that allows them to synchronize in this case, it doesn't make any sense to fence after disconnecting. Reversing the order of operations resolved the shutdown hang.This fixes #11087
This might fix openucx/ucx#8738