-
Notifications
You must be signed in to change notification settings - Fork 3
Description
I recently tried to start up a c5.12xlarge instance on AWS, and ran into the case of the /mnt/shared/etc/slurm.conf file claiming that the instance should have RealMem=94992, but when the node comes up, slurmctld.log shows that the node has less memory than slurm.conf indicates, and thus Slurm rejects the node (puts it in DRAIN state):
[2021-03-05T18:40:19.392] _slurm_rpc_submit_batch_job: JobId=9 InitPrio=4294901757 usec=551
[2021-03-05T18:40:19.879] sched: Allocate JobId=9 NodeList=nice-wolf-c5-12xlarge-0001 #CPUs=48 Partition=compute
[2021-03-05T18:41:58.852] error: Node nice-wolf-c5-12xlarge-0001 has low real_memory size (94256 < 94992)
[2021-03-05T18:41:58.852] Node nice-wolf-c5-12xlarge-0001 now responding
[2021-03-05T18:41:58.852] error: Setting node nice-wolf-c5-12xlarge-0001 state to DRAIN
[2021-03-05T18:41:58.852] drain_nodes: node nice-wolf-c5-12xlarge-0001 state set to DRAIN
[2021-03-05T18:41:58.852] error: _slurm_rpc_node_registration node=nice-wolf-c5-12xlarge-0001: Invalid argument
[2021-03-05T18:41:59.855] error: Node nice-wolf-c5-12xlarge-0001 has low real_memory size (94256 < 94992)
[2021-03-05T18:41:59.855] error: _slurm_rpc_node_registration node=nice-wolf-c5-12xlarge-0001: Invalid argument
This led me to the following calculation for expected RealMem for AWS:
Line 104 in c32b80a
| "memory": d["MemoryInfo"]["SizeInMiB"] |
"memory": d["MemoryInfo"]["SizeInMiB"] - int(math.pow(d["MemoryInfo"]["SizeInMiB"], 0.7) * 0.9 + 500),
Contrast this to GCP memory calculation:
Line 106 in c32b80a
| "memory": int(math.pow(mt["memoryMb"], 0.7) * 0.9 + 500), |
"memory": int(math.pow(mt["memoryMb"], 0.7) * 0.9 + 500),
It appears that the AWS config is attempting to estimate how much memory will actually be available (versus what is advertised), but the code for GCP is drastically underestimating.
Heavily under-estimating the amount of available memory allows Slurm to be more tolerant of nodes which don't quite meet their advertised claims, however it can cause issues when jobs request a specific amount of memory. These two cloud equations should probably be consistent, but I think the estimates need to be more conservative (lower) than AWS currently calculates, as shown by the example above with C5.12xlarge.