Problems with SLURM scheduling when using multiple processors per job #182

lauraengelhardt · 2025-07-09T15:24:27Z

lauraengelhardt
Jul 9, 2025

As already discussed offline @gilrei @maxdinkel, I experience problems when trying to run simulations using multiple processors on a cluster.

Example input

Let me start with an example input:

# Set up the model
driver = Jobscript(
    parameters=parameters,
    input_templates="test.4C.yaml",
    jobscript_template="/home/engelhardt/queens/templates/jobscripts/fourc_thought.sh",
    executable="~/4C/build/release/4C",
    extra_options={
        "post_processor": "/home/engelhardt/4C/build/release/post_vtu",
        "post_options": "--stress=cxyz --strain=cxyz",
        "cluster_script": "/lnm/share/donottouch.sh",
    },
    data_processor=data_processor,
    files_to_copy=[],
    raise_error_on_jobscript_failure=False,
)

# Setup the remote connection
...

# Setup the scheduler
scheduler = Cluster(
    workload_manager="slurm",
    walltime="04:00:00",
    remote_connection=remote_connection,
    num_procs=4,
    num_nodes=1,
    num_jobs=8,
    experiment_name=global_settings.experiment_name,
    allowed_failures=0,
)

# Set up the simulation model and run MC iterator
...

Problem

Running this on our LNM cluster thought does not work stably whenever num_procs ≠ 1 (e.g., also num_procs=4, num_jobs=16, or num_procs=8, num_jobs=8). Simulations randomly fail with the following error message (example):

A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     SOCKET
   Node:        node14
   #processes:  3
   #cpus:       2

You can override this protection by adding the "overload-allowed"
option to your binding directive.

We suspect (correct me if I’m wrong @gilrei @maxdinkel) that something goes wrong when processors from multiple nodes are allocated for a job.

Other observations:

When evaluating a very small sample size (e.g., 20), I may be lucky and everything runs without an error.
I noticed that #cpus is always #processes+1.
Our cluster has, e.g., 16 processors per node. Using 16 processors per job (or any other number of processes, e.g., 4, 2, 8 that multiplies to 16) does not work stable even though one could suspect that in this case all processors are allocated on one node. I suspect that if another user is, e.g., allocating 14 processors of a node, then these might get filled up, such that a job is distributed to multiple nodes.
I checked every time that enough resources are available, so this is not the issue.
We have two CPUS per

Working fixes

The only stable working solution fixing this is setting the --exclusive option for the SLURM job submissions (so just one job per node). This is of course very inefficient when using less processors than available per node.

Ideas what to test/check

Is it a hardware issue? Would it work on our other cluster Bruteforce?
Am I the only one experiencing this? @maxdinkel: Could you reproduce this behaviour?

sbrandstaeter · 2025-07-11T15:40:38Z

sbrandstaeter
Jul 11, 2025
Maintainer

If I remember correctly, I have experienced the same behaviour at our cluster.
Could you try changing this line
from "job_extra_directives": lambda nodes, cores: f"--ntasks={nodes * cores}", to "job_extra_directives": lambda nodes, cores: f"--nodes={nodes} --ntasks-per-node={cores}",. This should ensure that each job runs on a single node and is not distributed across several nodes. Still multiple jobs could run on a single node.

1 reply

sbrandstaeter Jul 14, 2025
Maintainer

As I understand it's less about multi-nodal jobs but more about multi-socket jobs that were causing the problem.

@maxdinkel has suggested a solution with #183

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with SLURM scheduling when using multiple processors per job #182

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Problems with SLURM scheduling when using multiple processors per job #182

Uh oh!

Uh oh!

lauraengelhardt Jul 9, 2025

Example input

Problem

Other observations:

Working fixes

Ideas what to test/check

Replies: 1 comment · 1 reply

Uh oh!

sbrandstaeter Jul 11, 2025 Maintainer

Uh oh!

sbrandstaeter Jul 14, 2025 Maintainer

lauraengelhardt
Jul 9, 2025

Replies: 1 comment 1 reply

sbrandstaeter
Jul 11, 2025
Maintainer

sbrandstaeter Jul 14, 2025
Maintainer