Slurm support by t-ramz · Pull Request #123 · TRIQS/solid_dmft

t-ramz · 2025-12-17T23:59:21Z

I have a user at ORNL's OLCF that has requested to use sold_dmft on one of our resources and I have had trouble making it run with the slurm workload manager. I made a few changes here that can help alleviate the problem, and make it "work," but there remains a problem I'm having where submitted sub-jobs are causing an MPI error where it causes an MPI_ABORT that may be beyond the scope of this software.

Overall, I think if you want to include slurm-only support out of the box there may need to be a switch or feature in place to divvy up tasks from the host MPI process and allocate to VASP/dft_exe accordingly. E.g. get all ranks up front running solid_dmft and when running dft_exe provide explicit rank mappings in some way.
Take this with a grain of salt, I don't know your codebase.

Let me know if you have thoughts or questions and I'll do my best to get back to you!

the-hampel · 2025-12-18T10:26:07Z

Hi @t-ramz ,
thank you for your addition! We can of course add such option. See two questions / comments I have above.

Regarding MPI_ABORT. If you have specific ideas let me know. We currently have anyway trouble often. For example if one of the ranks sends a signal 9 / 15 or something often other ranks don't die etc.

Best,
Alex

t-ramz · 2025-12-18T15:59:41Z

Hi Alex,

I believe I ran into the MPI_ABORT on my end because Slurm tried to allocate the same node and re-instantiate the base MPI communicator, which I speculate caused a fault in the parent process.

One way around this, which I discussed with a colleague, may be to create a secondary MPI communicator and use that communicator for the dft_exe process.

Thanks,
Anthony

the-hampel · 2025-12-19T08:57:40Z

Yes the forking and spawning a subprocess is not super elegant and a bit problematic for some HPC configs. Maybe it is worth exploring spawning another communicator. Maybe not in the scope of this PR? If you can comment on the above mentioned questions we can merge this soon.

Best,
Alex

the-hampel · 2025-12-18T10:16:30Z

python/solid_dmft/dft_managers/mpi_helpers.py


    hostnames = mpi.world.gather(socket.gethostname(), root=0)
+    if cluster_name == 'slurm':
+        slurm_hostnames = [hostname.split('.')[0] for hostname in hostnames]  # TODO: please find a better solution


can you enlighten me why this extra part is necessary in this case?

the-hampel · 2025-12-18T10:16:56Z

python/solid_dmft/dft_managers/mpi_helpers.py


+    if mpi_profile == 'slurm':
+        return [
+            mpi_exe, '-n', str(number_cores), '--export=PATH',


should we enforce here that mpi_exe is 'srun' ?

the-hampel · 2026-02-05T15:03:34Z

Sry I just remembered that this PR was still open. Do you see my last two comments?

if cluster_name == 'slurm':
        slurm_hostnames = [hostname.split('.')[0] for hostname in hostnames]  # TODO: please find a better solution

can you enlighten me why this extra part is necessary in this case?
2)

    if mpi_profile == 'slurm':
        return [
            mpi_exe, '-n', str(number_cores), '--export=PATH',

should we enforce here that mpi_exe is 'srun' ?

If we discuss these two than we can merge this.

Anthony Ramirez added 4 commits December 17, 2025 18:07

add minimal config for slurm support on OLCF CADES Baseline

d1da946

slurm hostfile path fix

0479d71

add walltime to slurm config

68459f3

fix hostnames for baseline

2fafc05

t-ramz marked this pull request as ready for review December 17, 2025 23:59

the-hampel reviewed Dec 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slurm support#123

Slurm support#123
t-ramz wants to merge 4 commits intoTRIQS:unstablefrom
t-ramz:slurm-support

t-ramz commented Dec 17, 2025

Uh oh!

the-hampel commented Dec 18, 2025

Uh oh!

t-ramz commented Dec 18, 2025

Uh oh!

the-hampel commented Dec 19, 2025

Uh oh!

the-hampel Dec 18, 2025

Uh oh!

the-hampel Dec 18, 2025

Uh oh!

the-hampel commented Feb 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

t-ramz commented Dec 17, 2025

Uh oh!

the-hampel commented Dec 18, 2025

Uh oh!

t-ramz commented Dec 18, 2025

Uh oh!

the-hampel commented Dec 19, 2025

Uh oh!

the-hampel Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

the-hampel Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

the-hampel commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

the-hampel commented Feb 5, 2026 •

edited

Loading