forked from percyfal/slurm
-
Notifications
You must be signed in to change notification settings - Fork 45
Open
Description
AWS ParallelCluster allows for running a Slurm cluster on Amazon AWS. Here are some things that do not work well with this profile, both for other users trying this, and to possibly make this work out of the box on default cluster installations.
Some resources:
- working profile for reference https://github.com/cbrueffer/snakemake-aws-parallelcluster-slurm
- blog post with some more ParallelCluster notes: https://www.brueffer.io/post/snakemake-aws-parallelcluster/
Tested with Snakemake version 7.8.5.
Issues:
- ParallelCluster Slurm cannot be used with
sbatch --mem. Using this option sends nodes straight into DRAINED state; see also https://blog.ronin.cloud/slurm-parallelcluster-troubleshooting/ - By default ParallelCluster does not come with accounting, so
sacctdoes not work. While the job status script supports querying usingscontrol, this also lead to issues in my case (to get this far in the first place I removedmem/mem-per-CPUfrom RESOURCE_MAPPING inslurm-submit.pyso jobs would run, see above):
127.0.0.1 - - [19/Jul/2022 14:17:45] "POST /job/register/11557 HTTP/1.1" 200 -
Submitted job 3661 with external jobid '11557'.
[Tue Jul 19 14:17:45 2022]
rule foo:
input: results/xxx.vcf.gz
output: results/xxx.pdf
jobid: 3568
reason: Missing output files: results/xxx.pdf
wildcards: sample=xxx
resources: mem_mb=1000, disk_mb=100000, tmpdir=/scratch, runtime=1000, partition=compute-small
[...]
Submitted job 3747 with external jobid '11561'.
/bin/sh: 11557: command not found
WorkflowError:
Failed to obtain job status. See above for error message.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Cluster sidecar process has terminated (retcode=0).
config.yaml:
restart-times: 3
jobscript: "slurm-jobscript.sh"
cluster: "slurm-submit.py"
cluster-status: "slurm-status.py"
cluster-status: ""
cluster-sidecar: "slurm-sidecar.py"
cluster-cancel: "scancel"
max-jobs-per-second: 1
max-status-checks-per-second: 10
local-cores: 1
latency-wait: 60
# Example resource configuration
default-resources:
- runtime=1000
# - mem_mb=4500
- disk_mb=100000
- tmpdir="/scratch"
- partition="compute-small"
# # set-threads: map rule names to threads
# set-threads:
# - single_core_rule=1
# - multi_core_rule=10
# # set-resources: map rule names to resources in general
# set-resources:
# - high_memory_rule:mem_mb=12000
# - long_running_rule:runtime=1200
settings.json
{
"SBATCH_DEFAULTS": "",
"CLUSTER_NAME": "",
"CLUSTER_CONFIG": ""
}
jdblischak
Metadata
Metadata
Assignees
Labels
No labels