Problems with SLURM scheduling when using multiple processors per job #182
Unanswered
lauraengelhardt
asked this question in
Q&A
Replies: 1 comment 1 reply
-
|
If I remember correctly, I have experienced the same behaviour at our cluster. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
As already discussed offline @gilrei @maxdinkel, I experience problems when trying to run simulations using multiple processors on a cluster.
Example input
Let me start with an example input:
Problem
Running this on our LNM cluster thought does not work stably whenever num_procs ≠ 1 (e.g., also num_procs=4, num_jobs=16, or num_procs=8, num_jobs=8). Simulations randomly fail with the following error message (example):
We suspect (correct me if I’m wrong @gilrei @maxdinkel) that something goes wrong when processors from multiple nodes are allocated for a job.
Other observations:
Working fixes
The only stable working solution fixing this is setting the
--exclusiveoption for the SLURM job submissions (so just one job per node). This is of course very inefficient when using less processors than available per node.Ideas what to test/check
Beta Was this translation helpful? Give feedback.
All reactions