Add Slurm support to rrun with PMIx-based coordination#775
Add Slurm support to rrun with PMIx-based coordination#775pentschev wants to merge 65 commits intorapidsai:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
nirandaperera
left a comment
There was a problem hiding this comment.
Did a partial review
| std::optional<std::string> getenv_optional(std::string_view name); | ||
|
|
||
| /** | ||
| * @brief Parse integer from environment variable. | ||
| * | ||
| * Retrieves an environment variable and parses it as an integer. | ||
| * | ||
| * @param name Name of the environment variable to retrieve. | ||
| * @return Parsed integer value, or std::nullopt if not set. | ||
| * @throws std::runtime_error if the variable is set but cannot be parsed as an integer. | ||
| */ | ||
| std::optional<int> getenv_int(std::string_view name); |
There was a problem hiding this comment.
Question: cant we use rapidsmpf Options here?
There was a problem hiding this comment.
No, we can't use most of the rest of RapidsMPF due to direct or indirect CUDA dependencies, and we can't have any CUDA dependencies in the bootstrapping. In the cpp files we have comments stating that:
// NOTE: Do not use RAPIDSMPF_EXPECTS or RAPIDSMPF_FAIL in this file.
// Using these macros introduces a CUDA dependency via rapidsmpf/error.hpp.
// Prefer throwing standard exceptions instead.
There was a problem hiding this comment.
Again, all the rrun stuff can't include cuda headers. So we could refactor things, but it is fiddly.
cpp/src/bootstrap/bootstrap.cpp
Outdated
| /** | ||
| * @brief Detect backend from environment variables. | ||
| */ | ||
| Backend detect_backend() { |
There was a problem hiding this comment.
Is this order the same as what's mentioned here https://github.com/rapidsai/rapidsmpf/pull/775/changes#diff-6b0f67c1b74a322dd090ffd62247887e46f1e57e97029778c34db6e341f0e5aeR25-R27 ? 🤔
There was a problem hiding this comment.
Good catch! This has changed many times during development, fixed it now.
cpp/src/bootstrap/bootstrap.cpp
Outdated
| @@ -108,6 +102,41 @@ Context init(Backend backend) { | |||
| } | |||
| break; | |||
There was a problem hiding this comment.
Can move these to some detail/ util methods outside the switch case scope?
Then we can switch like,
Context init(Backend backend) {
if (backend == Backend::AUTO) {
backend = detect_backend();
}
// Get rank and nranks based on backend
switch (backend) {
case Backend::FILE:
return file_backend_init();
...
...
cpp/src/bootstrap/bootstrap.cpp
Outdated
There was a problem hiding this comment.
Use RAPIDSMPF_EXPECTS?
There was a problem hiding this comment.
Can't without refactoring to make rapidsmpf/error.hpp not depend on cuda_runtime_api.h.
cpp/src/bootstrap/bootstrap.cpp
Outdated
| throw std::runtime_error( | ||
| "Could not determine rank for Slurm backend. " | ||
| "Ensure you're running with 'srun --mpi=pmix'." | ||
| ); | ||
| } | ||
|
|
||
| try { | ||
| ctx.nranks = get_nranks(); | ||
| } catch (const std::runtime_error& e) { | ||
| throw std::runtime_error( | ||
| "Could not determine nranks for Slurm backend. " | ||
| "Ensure you're running with 'srun --mpi=pmix'." | ||
| ); | ||
| } | ||
|
|
||
| if (!(ctx.rank >= 0 && ctx.rank < ctx.nranks)) { | ||
| throw std::runtime_error( | ||
| "Invalid rank: " + std::to_string(ctx.rank) + " must be in range [0, " | ||
| + std::to_string(ctx.nranks) + ")" | ||
| ); |
There was a problem hiding this comment.
No, see again my previous comment, and also at the top of this file https://github.com/rapidsai/rapidsmpf/pull/775/files#diff-d80df374c64a4d18659a70bf48420afd18a77b6cd650861db904aff4cb867765R19-R21.
cpp/src/bootstrap/bootstrap.cpp
Outdated
| #ifdef RAPIDSMPF_HAVE_SLURM | ||
| case Backend::SLURM: | ||
| { | ||
| detail::SlurmBackend backend{ctx}; |
There was a problem hiding this comment.
Should we create SlurmBackend every operation here? Or should we attach it to the ctx with a Backend interface class and shared_ptr?
There was a problem hiding this comment.
I agree, this wasn't ideal and I had plans to address it in a future PR. Since you pointed out I'll address it here.
| void check_pmix_status(pmix_status_t status, std::string const& operation) { | ||
| if (status != PMIX_SUCCESS) { | ||
| throw std::runtime_error(operation + " failed: " + pmix_error_string(status)); | ||
| } | ||
| } |
There was a problem hiding this comment.
Maybe, let's use a helper function like RAPIDSMPF_MPI
There was a problem hiding this comment.
This is the helper function, I'm not sure exactly what you're proposing, is it just the name not following the same RAPIDSMPF_... pattern or something else? If it's just the naming pattern I can adjust it, but I tried to prevent using the same pattern given the CUDA restrictions we have to avoid creating confusion with the "regular" RapidsMPF code.
cpp/src/bootstrap/ucxx.cpp
Outdated
| std::string result; | ||
| result.reserve(input.size() * 2); | ||
| for (char ch : input) { | ||
| auto c = static_cast<unsigned char>(ch); | ||
| result.push_back(hex_chars[c >> 4]); | ||
| result.push_back(hex_chars[c & 0x0F]); | ||
| } |
There was a problem hiding this comment.
should we use stringstream?
There was a problem hiding this comment.
We could use a stringstream if there's a strong preference for that, however, the existing implementation should have considerably better performance, given it's not overly complicated I'd prefer to leave the current one.
| /** | ||
| * @brief Apply topology-based bindings for a specific GPU. | ||
| * | ||
| * This function sets CPU affinity, NUMA memory binding, and network device | ||
| * environment variables based on the topology information for the given GPU. | ||
| * | ||
| * @param cfg Configuration containing topology information. | ||
| * @param gpu_id GPU ID to apply bindings for. | ||
| * @param verbose Print warnings on failure. | ||
| */ | ||
| void apply_topology_bindings(Config const& cfg, int gpu_id, bool verbose) { | ||
| if (!cfg.topology.has_value() || gpu_id < 0) { | ||
| return; | ||
| } | ||
|
|
||
| auto it = cfg.gpu_topology_map.find(gpu_id); | ||
| if (it == cfg.gpu_topology_map.end()) { | ||
| if (verbose) { | ||
| std::cerr << "[rrun] Warning: No topology information for GPU " << gpu_id | ||
| << std::endl; | ||
| } | ||
| return; | ||
| } | ||
|
|
||
| auto const& gpu_info = *it->second; | ||
|
|
||
| if (cfg.bind_cpu && !gpu_info.cpu_affinity_list.empty()) { | ||
| if (!set_cpu_affinity(gpu_info.cpu_affinity_list)) { | ||
| if (verbose) { | ||
| std::cerr << "[rrun] Warning: Failed to set CPU affinity for GPU " | ||
| << gpu_id << std::endl; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if (cfg.bind_memory && !gpu_info.memory_binding.empty()) { | ||
| if (!set_numa_memory_binding(gpu_info.memory_binding)) { | ||
| #if RAPIDSMPF_HAVE_NUMA | ||
| if (verbose) { | ||
| std::cerr << "[rrun] Warning: Failed to set NUMA memory binding for GPU " | ||
| << gpu_id << std::endl; | ||
| } | ||
| #endif | ||
| } | ||
| } | ||
|
|
||
| if (cfg.bind_network && !gpu_info.network_devices.empty()) { | ||
| std::string ucx_net_devices; | ||
| for (size_t i = 0; i < gpu_info.network_devices.size(); ++i) { | ||
| if (i > 0) { | ||
| ucx_net_devices += ","; | ||
| } | ||
| ucx_net_devices += gpu_info.network_devices[i]; | ||
| } | ||
| setenv("UCX_NET_DEVICES", ucx_net_devices.c_str(), 1); | ||
| } | ||
| } |
There was a problem hiding this comment.
This was just moved to a new function for better organization.
| } else if (arg == "--") { | ||
| // Everything after -- is the application and its arguments | ||
| if (i + 1 < argc) { | ||
| cfg.app_binary = argv[i + 1]; | ||
| for (int j = i + 2; j < argc; ++j) { | ||
| cfg.app_args.push_back(argv[j]); | ||
| } | ||
| } | ||
| break; |
There was a problem hiding this comment.
Optional -- separator for disambiguation of rrun arguments and application+arguments. Not specifying one is still supported via the else condition below.
| std::optional<std::string> getenv_optional(std::string_view name); | ||
|
|
||
| /** | ||
| * @brief Parse integer from environment variable. | ||
| * | ||
| * Retrieves an environment variable and parses it as an integer. | ||
| * | ||
| * @param name Name of the environment variable to retrieve. | ||
| * @return Parsed integer value, or std::nullopt if not set. | ||
| * @throws std::runtime_error if the variable is set but cannot be parsed as an integer. | ||
| */ | ||
| std::optional<int> getenv_int(std::string_view name); |
There was a problem hiding this comment.
No, we can't use most of the rest of RapidsMPF due to direct or indirect CUDA dependencies, and we can't have any CUDA dependencies in the bootstrapping. In the cpp files we have comments stating that:
// NOTE: Do not use RAPIDSMPF_EXPECTS or RAPIDSMPF_FAIL in this file.
// Using these macros introduces a CUDA dependency via rapidsmpf/error.hpp.
// Prefer throwing standard exceptions instead.
cpp/src/bootstrap/bootstrap.cpp
Outdated
| /** | ||
| * @brief Detect backend from environment variables. | ||
| */ | ||
| Backend detect_backend() { |
There was a problem hiding this comment.
Good catch! This has changed many times during development, fixed it now.
cpp/src/bootstrap/bootstrap.cpp
Outdated
| @@ -108,6 +102,41 @@ Context init(Backend backend) { | |||
| } | |||
| break; | |||
cpp/src/bootstrap/bootstrap.cpp
Outdated
| throw std::runtime_error( | ||
| "Could not determine rank for Slurm backend. " | ||
| "Ensure you're running with 'srun --mpi=pmix'." | ||
| ); | ||
| } | ||
|
|
||
| try { | ||
| ctx.nranks = get_nranks(); | ||
| } catch (const std::runtime_error& e) { | ||
| throw std::runtime_error( | ||
| "Could not determine nranks for Slurm backend. " | ||
| "Ensure you're running with 'srun --mpi=pmix'." | ||
| ); | ||
| } | ||
|
|
||
| if (!(ctx.rank >= 0 && ctx.rank < ctx.nranks)) { | ||
| throw std::runtime_error( | ||
| "Invalid rank: " + std::to_string(ctx.rank) + " must be in range [0, " | ||
| + std::to_string(ctx.nranks) + ")" | ||
| ); |
There was a problem hiding this comment.
No, see again my previous comment, and also at the top of this file https://github.com/rapidsai/rapidsmpf/pull/775/files#diff-d80df374c64a4d18659a70bf48420afd18a77b6cd650861db904aff4cb867765R19-R21.
| void check_pmix_status(pmix_status_t status, std::string const& operation) { | ||
| if (status != PMIX_SUCCESS) { | ||
| throw std::runtime_error(operation + " failed: " + pmix_error_string(status)); | ||
| } | ||
| } |
There was a problem hiding this comment.
This is the helper function, I'm not sure exactly what you're proposing, is it just the name not following the same RAPIDSMPF_... pattern or something else? If it's just the naming pattern I can adjust it, but I tried to prevent using the same pattern given the CUDA restrictions we have to avoid creating confusion with the "regular" RapidsMPF code.
cpp/src/bootstrap/ucxx.cpp
Outdated
| std::string result; | ||
| result.reserve(input.size() * 2); | ||
| for (char ch : input) { | ||
| auto c = static_cast<unsigned char>(ch); | ||
| result.push_back(hex_chars[c >> 4]); | ||
| result.push_back(hex_chars[c & 0x0F]); | ||
| } |
There was a problem hiding this comment.
We could use a stringstream if there's a strong preference for that, however, the existing implementation should have considerably better performance, given it's not overly complicated I'd prefer to leave the current one.
cpp/src/bootstrap/bootstrap.cpp
Outdated
| #ifdef RAPIDSMPF_HAVE_SLURM | ||
| case Backend::SLURM: | ||
| { | ||
| detail::SlurmBackend backend{ctx}; |
There was a problem hiding this comment.
I agree, this wasn't ideal and I had plans to address it in a future PR. Since you pointed out I'll address it here.
| build-pmix: | ||
| common: | ||
| - output_types: [conda, pyproject, requirements] | ||
| packages: | ||
| - libpmix-devel >=5.0,<6.0 |
There was a problem hiding this comment.
pmix isn't available on pypi (I don't think we should try and make it so either). So I think this should only go to the conda output?
| * | ||
| * Different backends have different visibility semantics for put() operations: | ||
| * - Slurm/PMIx: Requires explicit fence (PMIx_Fence) to make data visible across nodes. | ||
| * - FILE: put() operations are immediately visible via atomic filesystem operations. |
There was a problem hiding this comment.
Aside (we don't have to do anything here). Many shared filesystems actually don't promise posix-style atomicity for rename/fclose etc...
There was a problem hiding this comment.
This is a good point. We should definitely handle that better in the future.
| * For FileBackend, this is a no-op since put() operations use atomic | ||
| * file writes that are immediately visible to all processes via the | ||
| * shared filesystem. | ||
| */ |
There was a problem hiding this comment.
| */ |
This is an implementation detail that could easily become out of date.
There was a problem hiding this comment.
It indeed is. However, we explain happens for FileBackend, we do the same for SlurmBackend. Would you prefer this is moved to a comment in the implementation? I think it's important to document it somewhere, but I don't have a preference if here or in the implementation as comment.
There was a problem hiding this comment.
I see now you have more questions/suggestions about this. I just moved all the implementation details from docstrings as comments in the implementation, also improved the interface documentation to be generally suitable for all backends in 7d1c478.
| * # Hybrid mode: one task per node, 4 GPUs per task, two nodes. | ||
| * srun \ | ||
| * --mpi=pmix \ | ||
| * --nodes=2 \ | ||
| * --ntasks-per-node=1 \ | ||
| * --cpus-per-task=144 \ | ||
| * --gpus-per-task=4 \ | ||
| * --gres=gpu:4 \ | ||
| * rrun -n 4 ./benchmarks/bench_shuffle -C ucxx |
There was a problem hiding this comment.
question: Can you explain the reason to want to support both passthrough (makes sense to me, all the information is configured using slurm) and "hybrid", where we do only partial launching with srun?
There was a problem hiding this comment.
Yes, the passthrough mode essentially requires the user to determine the ideal topoplogy (number of tasks, GPUs, CPUs, etc.) and in that case rrun is only coordinating the bootstrapping and nothing more. In the hybrid mode the user just needs to specify the number of processes (currently should match the number of GPUs) and rrun will ensure all the topology is properly setup, just like if you were running baremetal, and also (will) allow the user to specify a custom topology setup (i.e., provide a json with the exact setup that is desired, which may be useful for new HW bringup and experimentation, in case, for example, the number of CPUs should be partitioned unevenly). IOW, the hybrid mode is both a convenience and insurance in case Slurm doesn't necessarily does everything properly.
Do you think I should document what I wrote above in the code? My plan is to add a better document (like a README file) with more details about bootstrap in general in a follow-up PR, so not sure if we want all that information in the code here as well.
There was a problem hiding this comment.
One more thing about this, I've been thinking of alternative ways to support launching for both single-node and multi-node non-Slurm setups and we'll need some sort of KVS for that, so one thing I have in mind is to use PMIx itself for that. I know we would need some infrastructure to leverage PMIx (since Slurm is not present in those cases where PMIx is already there), but the hybrid implementation would likely be reused for that, although I have nothing more concrete yet.
There was a problem hiding this comment.
IOW, the hybrid mode is both a convenience and insurance in case Slurm doesn't necessarily does everything properly.
Is this a concern today?
I think it would be easier to review this PR if we brought the two modes in in two pieces rather than all at once.
There was a problem hiding this comment.
Fair enough, let me break this down into two PRs. I'll have the current PR implementing most of the infrastructure and the passthrough mode, and move the hybrid mode into a new PR.
| * The key-value pair is committed immediately and made visible to other | ||
| * ranks via a fence operation. |
There was a problem hiding this comment.
| * The key-value pair is committed immediately and made visible to other | |
| * ranks via a fence operation. | |
| * The key-value pair is committed immediately and made visible to other | |
| * ranks after a collective `sync()`. |
None of the methods mention the word "fence".
| * All ranks must call this before any rank proceeds. The fence also | ||
| * ensures all committed key-value pairs are visible to all ranks. | ||
| * |
There was a problem hiding this comment.
I think we should remove the committing key-value pairs part. In the abstract backend sync makes things visible, barrier is just a barrier.
| * @brief Ensure all previous put() operations are globally visible. | ||
| * | ||
| * For Slurm/PMIx backend, this executes PMIx_Fence to make all committed | ||
| * key-value pairs visible across all nodes. This is required because | ||
| * PMIx_Put + PMIx_Commit only makes data locally visible; PMIx_Fence | ||
| * performs the global synchronization and data exchange. | ||
| * | ||
| * @throws std::runtime_error if PMIx_Fence fails. |
There was a problem hiding this comment.
Again, I am not sure the docstring should talk about the implementation.
| std::optional<std::string> getenv_optional(std::string_view name); | ||
|
|
||
| /** | ||
| * @brief Parse integer from environment variable. | ||
| * | ||
| * Retrieves an environment variable and parses it as an integer. | ||
| * | ||
| * @param name Name of the environment variable to retrieve. | ||
| * @return Parsed integer value, or std::nullopt if not set. | ||
| * @throws std::runtime_error if the variable is set but cannot be parsed as an integer. | ||
| */ | ||
| std::optional<int> getenv_int(std::string_view name); |
There was a problem hiding this comment.
Again, all the rrun stuff can't include cuda headers. So we could refactor things, but it is fiddly.
cpp/tools/rrun.cpp
Outdated
| * | ||
| * @throws std::runtime_error on timeout or launch failure. | ||
| */ | ||
| std::string launch_rank0_and_get_address( |
There was a problem hiding this comment.
question: It seems like a lot of the complexity in launching here is around hybrid mode. But I am still not entirely clear why we need it, can you explain?
There was a problem hiding this comment.
Please see my response to your previous question https://github.com/rapidsai/rapidsmpf/pull/775/files#r2767917764.
cpp/tools/rrun.cpp
Outdated
| // Barrier to ensure data exchange | ||
| backend.barrier(); |
There was a problem hiding this comment.
My understanding of the API contract is that this should be backend.sync(). As it happens barrier and sync for the pmix backend both do the same thing. But semantically, we're documented as requiring sync here.
There was a problem hiding this comment.
Indeed, this precedes the time sync() was introduced so it's outdated. Fixed now in acc85fc.
gforsyth
left a comment
There was a problem hiding this comment.
The updates to the conda recipes look good to me, but I'd like @KyleFromNVIDIA to take a look at the CMake changes
gforsyth
left a comment
There was a problem hiding this comment.
Oh, ok, I'm not part of the cmake codeowners here (correctly) -- approving packaging changes.
Thanks Gil, appreciate it. Would indeed be nice to have Kyle review CMake as well, thanks for tagging him. |
This PR adds Slurm support for
rrun, enabling RapidsMPF to run without MPI. This is achieved by addingSlurmBackendclass that wraps PMIx for process coordination, implementing bootstrap operations (put/get/barrier/sync) using PMIx primitives.The new execution mode delivers a passthrough approach with multiple tasks per node, one task per GPU. This is similar to the way MPI applications launch in Slurm, but unlike
mpirunwhich should not be part of the application execution,rrunmust act as launcher to the application. Ifrrunis omitted, Slurm will automatically fallback to MPI (if available).Usage example:
srun \ --mpi=pmix \ --nodes=2 \ --ntasks-per-node=4 \ --cpus-per-task=36 \ --gpus-per-task=1 \ --gres=gpu:4 \ rrun ./benchmarks/bench_shuffle -C ucxx