Skip to content

Add Slurm support to rrun with PMIx-based coordination#775

Open
pentschev wants to merge 65 commits intorapidsai:mainfrom
pentschev:rrun-slurm
Open

Add Slurm support to rrun with PMIx-based coordination#775
pentschev wants to merge 65 commits intorapidsai:mainfrom
pentschev:rrun-slurm

Conversation

@pentschev
Copy link
Member

@pentschev pentschev commented Jan 11, 2026

This PR adds Slurm support for rrun, enabling RapidsMPF to run without MPI. This is achieved by adding SlurmBackend class that wraps PMIx for process coordination, implementing bootstrap operations (put/get/barrier/sync) using PMIx primitives.

The new execution mode delivers a passthrough approach with multiple tasks per node, one task per GPU. This is similar to the way MPI applications launch in Slurm, but unlike mpirun which should not be part of the application execution, rrun must act as launcher to the application. If rrun is omitted, Slurm will automatically fallback to MPI (if available).

Usage example:

  srun \
      --mpi=pmix \
      --nodes=2 \
      --ntasks-per-node=4 \
      --cpus-per-task=36 \
      --gpus-per-task=1 \
      --gres=gpu:4 \
      rrun ./benchmarks/bench_shuffle -C ucxx

@pentschev pentschev self-assigned this Jan 11, 2026
@pentschev pentschev added feature request New feature or request non-breaking Introduces a non-breaking change labels Jan 11, 2026
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 11, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@pentschev
Copy link
Member Author

/ok to test

@pentschev
Copy link
Member Author

/ok to test

@pentschev
Copy link
Member Author

/ok to test

@pentschev
Copy link
Member Author

/ok to test

@pentschev
Copy link
Member Author

/ok to test

@pentschev
Copy link
Member Author

/ok to test

@pentschev
Copy link
Member Author

/ok to test

@pentschev pentschev changed the title Slurm support for rrun Add Slurm support to rrun with PMIx-based coordination Feb 4, 2026
@pentschev pentschev marked this pull request as ready for review February 4, 2026 22:37
@pentschev pentschev requested review from a team as code owners February 4, 2026 22:37
@pentschev pentschev requested a review from gforsyth February 4, 2026 22:37
Copy link
Contributor

@nirandaperera nirandaperera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a partial review

Comment on lines +26 to +37
std::optional<std::string> getenv_optional(std::string_view name);

/**
* @brief Parse integer from environment variable.
*
* Retrieves an environment variable and parses it as an integer.
*
* @param name Name of the environment variable to retrieve.
* @return Parsed integer value, or std::nullopt if not set.
* @throws std::runtime_error if the variable is set but cannot be parsed as an integer.
*/
std::optional<int> getenv_int(std::string_view name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: cant we use rapidsmpf Options here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we can't use most of the rest of RapidsMPF due to direct or indirect CUDA dependencies, and we can't have any CUDA dependencies in the bootstrapping. In the cpp files we have comments stating that:

// NOTE: Do not use RAPIDSMPF_EXPECTS or RAPIDSMPF_FAIL in this file.
// Using these macros introduces a CUDA dependency via rapidsmpf/error.hpp.
// Prefer throwing standard exceptions instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, all the rrun stuff can't include cuda headers. So we could refactor things, but it is fiddly.

/**
* @brief Detect backend from environment variables.
*/
Backend detect_backend() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This has changed many times during development, fixed it now.

@@ -108,6 +102,41 @@ Context init(Backend backend) {
}
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can move these to some detail/ util methods outside the switch case scope?

Then we can switch like,

Context init(Backend backend) {
    if (backend == Backend::AUTO) {
       backend = detect_backend();
    }

    // Get rank and nranks based on backend
    switch (backend) {
          case Backend::FILE:
                  return file_backend_init(); 
          ...
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

Comment on lines 72 to 91
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use RAPIDSMPF_EXPECTS?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't without refactoring to make rapidsmpf/error.hpp not depend on cuda_runtime_api.h.

Comment on lines 111 to 130
throw std::runtime_error(
"Could not determine rank for Slurm backend. "
"Ensure you're running with 'srun --mpi=pmix'."
);
}

try {
ctx.nranks = get_nranks();
} catch (const std::runtime_error& e) {
throw std::runtime_error(
"Could not determine nranks for Slurm backend. "
"Ensure you're running with 'srun --mpi=pmix'."
);
}

if (!(ctx.rank >= 0 && ctx.rank < ctx.nranks)) {
throw std::runtime_error(
"Invalid rank: " + std::to_string(ctx.rank) + " must be in range [0, "
+ std::to_string(ctx.nranks) + ")"
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAPIDSMPF_FAIL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#ifdef RAPIDSMPF_HAVE_SLURM
case Backend::SLURM:
{
detail::SlurmBackend backend{ctx};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create SlurmBackend every operation here? Or should we attach it to the ctx with a Backend interface class and shared_ptr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this wasn't ideal and I had plans to address it in a future PR. Since you pointed out I'll address it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done now in 82441ef .

Comment on lines +98 to +102
void check_pmix_status(pmix_status_t status, std::string const& operation) {
if (status != PMIX_SUCCESS) {
throw std::runtime_error(operation + " failed: " + pmix_error_string(status));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, let's use a helper function like RAPIDSMPF_MPI

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the helper function, I'm not sure exactly what you're proposing, is it just the name not following the same RAPIDSMPF_... pattern or something else? If it's just the naming pattern I can adjust it, but I tried to prevent using the same pattern given the CUDA restrictions we have to avoid creating confusion with the "regular" RapidsMPF code.

Comment on lines 32 to 38
std::string result;
result.reserve(input.size() * 2);
for (char ch : input) {
auto c = static_cast<unsigned char>(ch);
result.push_back(hex_chars[c >> 4]);
result.push_back(hex_chars[c & 0x0F]);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use stringstream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use a stringstream if there's a strong preference for that, however, the existing implementation should have considerably better performance, given it's not overly complicated I'd prefer to leave the current one.

Comment on lines +422 to +478
/**
* @brief Apply topology-based bindings for a specific GPU.
*
* This function sets CPU affinity, NUMA memory binding, and network device
* environment variables based on the topology information for the given GPU.
*
* @param cfg Configuration containing topology information.
* @param gpu_id GPU ID to apply bindings for.
* @param verbose Print warnings on failure.
*/
void apply_topology_bindings(Config const& cfg, int gpu_id, bool verbose) {
if (!cfg.topology.has_value() || gpu_id < 0) {
return;
}

auto it = cfg.gpu_topology_map.find(gpu_id);
if (it == cfg.gpu_topology_map.end()) {
if (verbose) {
std::cerr << "[rrun] Warning: No topology information for GPU " << gpu_id
<< std::endl;
}
return;
}

auto const& gpu_info = *it->second;

if (cfg.bind_cpu && !gpu_info.cpu_affinity_list.empty()) {
if (!set_cpu_affinity(gpu_info.cpu_affinity_list)) {
if (verbose) {
std::cerr << "[rrun] Warning: Failed to set CPU affinity for GPU "
<< gpu_id << std::endl;
}
}
}

if (cfg.bind_memory && !gpu_info.memory_binding.empty()) {
if (!set_numa_memory_binding(gpu_info.memory_binding)) {
#if RAPIDSMPF_HAVE_NUMA
if (verbose) {
std::cerr << "[rrun] Warning: Failed to set NUMA memory binding for GPU "
<< gpu_id << std::endl;
}
#endif
}
}

if (cfg.bind_network && !gpu_info.network_devices.empty()) {
std::string ucx_net_devices;
for (size_t i = 0; i < gpu_info.network_devices.size(); ++i) {
if (i > 0) {
ucx_net_devices += ",";
}
ucx_net_devices += gpu_info.network_devices[i];
}
setenv("UCX_NET_DEVICES", ucx_net_devices.c_str(), 1);
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just moved to a new function for better organization.

Comment on lines +607 to +615
} else if (arg == "--") {
// Everything after -- is the application and its arguments
if (i + 1 < argc) {
cfg.app_binary = argv[i + 1];
for (int j = i + 2; j < argc; ++j) {
cfg.app_args.push_back(argv[j]);
}
}
break;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional -- separator for disambiguation of rrun arguments and application+arguments. Not specifying one is still supported via the else condition below.

Comment on lines +26 to +37
std::optional<std::string> getenv_optional(std::string_view name);

/**
* @brief Parse integer from environment variable.
*
* Retrieves an environment variable and parses it as an integer.
*
* @param name Name of the environment variable to retrieve.
* @return Parsed integer value, or std::nullopt if not set.
* @throws std::runtime_error if the variable is set but cannot be parsed as an integer.
*/
std::optional<int> getenv_int(std::string_view name);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we can't use most of the rest of RapidsMPF due to direct or indirect CUDA dependencies, and we can't have any CUDA dependencies in the bootstrapping. In the cpp files we have comments stating that:

// NOTE: Do not use RAPIDSMPF_EXPECTS or RAPIDSMPF_FAIL in this file.
// Using these macros introduces a CUDA dependency via rapidsmpf/error.hpp.
// Prefer throwing standard exceptions instead.

/**
* @brief Detect backend from environment variables.
*/
Backend detect_backend() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This has changed many times during development, fixed it now.

@@ -108,6 +102,41 @@ Context init(Backend backend) {
}
break;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

Comment on lines 111 to 130
throw std::runtime_error(
"Could not determine rank for Slurm backend. "
"Ensure you're running with 'srun --mpi=pmix'."
);
}

try {
ctx.nranks = get_nranks();
} catch (const std::runtime_error& e) {
throw std::runtime_error(
"Could not determine nranks for Slurm backend. "
"Ensure you're running with 'srun --mpi=pmix'."
);
}

if (!(ctx.rank >= 0 && ctx.rank < ctx.nranks)) {
throw std::runtime_error(
"Invalid rank: " + std::to_string(ctx.rank) + " must be in range [0, "
+ std::to_string(ctx.nranks) + ")"
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +98 to +102
void check_pmix_status(pmix_status_t status, std::string const& operation) {
if (status != PMIX_SUCCESS) {
throw std::runtime_error(operation + " failed: " + pmix_error_string(status));
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the helper function, I'm not sure exactly what you're proposing, is it just the name not following the same RAPIDSMPF_... pattern or something else? If it's just the naming pattern I can adjust it, but I tried to prevent using the same pattern given the CUDA restrictions we have to avoid creating confusion with the "regular" RapidsMPF code.

Comment on lines 32 to 38
std::string result;
result.reserve(input.size() * 2);
for (char ch : input) {
auto c = static_cast<unsigned char>(ch);
result.push_back(hex_chars[c >> 4]);
result.push_back(hex_chars[c & 0x0F]);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use a stringstream if there's a strong preference for that, however, the existing implementation should have considerably better performance, given it's not overly complicated I'd prefer to leave the current one.

#ifdef RAPIDSMPF_HAVE_SLURM
case Backend::SLURM:
{
detail::SlurmBackend backend{ctx};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this wasn't ideal and I had plans to address it in a future PR. Since you pointed out I'll address it here.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial

Comment on lines 222 to 226
build-pmix:
common:
- output_types: [conda, pyproject, requirements]
packages:
- libpmix-devel >=5.0,<6.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pmix isn't available on pypi (I don't think we should try and make it so either). So I think this should only go to the conda output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed.

*
* Different backends have different visibility semantics for put() operations:
* - Slurm/PMIx: Requires explicit fence (PMIx_Fence) to make data visible across nodes.
* - FILE: put() operations are immediately visible via atomic filesystem operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside (we don't have to do anything here). Many shared filesystems actually don't promise posix-style atomicity for rename/fclose etc...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. We should definitely handle that better in the future.

* For FileBackend, this is a no-op since put() operations use atomic
* file writes that are immediately visible to all processes via the
* shared filesystem.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*/

This is an implementation detail that could easily become out of date.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It indeed is. However, we explain happens for FileBackend, we do the same for SlurmBackend. Would you prefer this is moved to a comment in the implementation? I think it's important to document it somewhere, but I don't have a preference if here or in the implementation as comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now you have more questions/suggestions about this. I just moved all the implementation details from docstrings as comments in the implementation, also improved the interface documentation to be generally suitable for all backends in 7d1c478.

Comment on lines 42 to 50
* # Hybrid mode: one task per node, 4 GPUs per task, two nodes.
* srun \
* --mpi=pmix \
* --nodes=2 \
* --ntasks-per-node=1 \
* --cpus-per-task=144 \
* --gpus-per-task=4 \
* --gres=gpu:4 \
* rrun -n 4 ./benchmarks/bench_shuffle -C ucxx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Can you explain the reason to want to support both passthrough (makes sense to me, all the information is configured using slurm) and "hybrid", where we do only partial launching with srun?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the passthrough mode essentially requires the user to determine the ideal topoplogy (number of tasks, GPUs, CPUs, etc.) and in that case rrun is only coordinating the bootstrapping and nothing more. In the hybrid mode the user just needs to specify the number of processes (currently should match the number of GPUs) and rrun will ensure all the topology is properly setup, just like if you were running baremetal, and also (will) allow the user to specify a custom topology setup (i.e., provide a json with the exact setup that is desired, which may be useful for new HW bringup and experimentation, in case, for example, the number of CPUs should be partitioned unevenly). IOW, the hybrid mode is both a convenience and insurance in case Slurm doesn't necessarily does everything properly.

Do you think I should document what I wrote above in the code? My plan is to add a better document (like a README file) with more details about bootstrap in general in a follow-up PR, so not sure if we want all that information in the code here as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing about this, I've been thinking of alternative ways to support launching for both single-node and multi-node non-Slurm setups and we'll need some sort of KVS for that, so one thing I have in mind is to use PMIx itself for that. I know we would need some infrastructure to leverage PMIx (since Slurm is not present in those cases where PMIx is already there), but the hybrid implementation would likely be reused for that, although I have nothing more concrete yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOW, the hybrid mode is both a convenience and insurance in case Slurm doesn't necessarily does everything properly.

Is this a concern today?

I think it would be easier to review this PR if we brought the two modes in in two pieces rather than all at once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, let me break this down into two PRs. I'll have the current PR implementing most of the infrastructure and the passthrough mode, and move the hybrid mode into a new PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member Author

@pentschev pentschev Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed it now in 5b9da5c (about 500 LOC shorter now) and started a new draft #844 to be reviewed/merged after this goes in.

Comment on lines 80 to 81
* The key-value pair is committed immediately and made visible to other
* ranks via a fence operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The key-value pair is committed immediately and made visible to other
* ranks via a fence operation.
* The key-value pair is committed immediately and made visible to other
* ranks after a collective `sync()`.

None of the methods mention the word "fence".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7d1c478

Comment on lines 107 to 109
* All ranks must call this before any rank proceeds. The fence also
* ensures all committed key-value pairs are visible to all ranks.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove the committing key-value pairs part. In the abstract backend sync makes things visible, barrier is just a barrier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7d1c478

Comment on lines 115 to 122
* @brief Ensure all previous put() operations are globally visible.
*
* For Slurm/PMIx backend, this executes PMIx_Fence to make all committed
* key-value pairs visible across all nodes. This is required because
* PMIx_Put + PMIx_Commit only makes data locally visible; PMIx_Fence
* performs the global synchronization and data exchange.
*
* @throws std::runtime_error if PMIx_Fence fails.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I am not sure the docstring should talk about the implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7d1c478

Comment on lines +26 to +37
std::optional<std::string> getenv_optional(std::string_view name);

/**
* @brief Parse integer from environment variable.
*
* Retrieves an environment variable and parses it as an integer.
*
* @param name Name of the environment variable to retrieve.
* @return Parsed integer value, or std::nullopt if not set.
* @throws std::runtime_error if the variable is set but cannot be parsed as an integer.
*/
std::optional<int> getenv_int(std::string_view name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, all the rrun stuff can't include cuda headers. So we could refactor things, but it is fiddly.

*
* @throws std::runtime_error on timeout or launch failure.
*/
std::string launch_rank0_and_get_address(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: It seems like a lot of the complexity in launching here is around hybrid mode. But I am still not entirely clear why we need it, can you explain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my response to your previous question https://github.com/rapidsai/rapidsmpf/pull/775/files#r2767917764.

Comment on lines 1176 to 1177
// Barrier to ensure data exchange
backend.barrier();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the API contract is that this should be backend.sync(). As it happens barrier and sync for the pmix backend both do the same thing. But semantically, we're documented as requiring sync here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this precedes the time sync() was introduced so it's outdated. Fixed now in acc85fc.

@pentschev pentschev requested a review from a team as a code owner February 5, 2026 09:41
Copy link
Contributor

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updates to the conda recipes look good to me, but I'd like @KyleFromNVIDIA to take a look at the CMake changes

Copy link
Contributor

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ok, I'm not part of the cmake codeowners here (correctly) -- approving packaging changes.

@pentschev
Copy link
Member Author

The updates to the conda recipes look good to me, but I'd like @KyleFromNVIDIA to take a look at the CMake changes

Oh, ok, I'm not part of the cmake codeowners here (correctly) -- approving packaging changes.

Thanks Gil, appreciate it. Would indeed be nice to have Kyle review CMake as well, thanks for tagging him.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants