Instance Memory calculation

I recently tried to start up a c5.12xlarge instance on AWS, and ran into the case of the `/mnt/shared/etc/slurm.conf` file claiming that the instance should have `RealMem=94992`, but when the node comes up, `slurmctld.log` shows that the node has less memory than `slurm.conf` indicates, and thus Slurm rejects the node (puts it in DRAIN state):


> [2021-03-05T18:40:19.392] _slurm_rpc_submit_batch_job: JobId=9 InitPrio=4294901757 usec=551
> [2021-03-05T18:40:19.879] sched: Allocate JobId=9 NodeList=nice-wolf-c5-12xlarge-0001 #CPUs=48 Partition=compute
> [2021-03-05T18:41:58.852] error: Node nice-wolf-c5-12xlarge-0001 has low real_memory size (94256 < 94992)
> [2021-03-05T18:41:58.852] Node nice-wolf-c5-12xlarge-0001 now responding
> [2021-03-05T18:41:58.852] error: Setting node nice-wolf-c5-12xlarge-0001 state to DRAIN
> [2021-03-05T18:41:58.852] drain_nodes: node nice-wolf-c5-12xlarge-0001 state set to DRAIN
> [2021-03-05T18:41:58.852] error: _slurm_rpc_node_registration node=nice-wolf-c5-12xlarge-0001: Invalid argument
> [2021-03-05T18:41:59.855] error: Node nice-wolf-c5-12xlarge-0001 has low real_memory size (94256 < 94992)
> [2021-03-05T18:41:59.855] error: _slurm_rpc_node_registration node=nice-wolf-c5-12xlarge-0001: Invalid argument


This led me to the following calculation for expected `RealMem` for AWS:

https://github.com/clusterinthecloud/python-citc/blob/c32b80a71ad2e8b8c25e4b639caf3a302b15f54a/citc/aws.py#L104

            "memory": d["MemoryInfo"]["SizeInMiB"] - int(math.pow(d["MemoryInfo"]["SizeInMiB"], 0.7) * 0.9 + 500),


Contrast this to GCP memory calculation:

https://github.com/clusterinthecloud/python-citc/blob/c32b80a71ad2e8b8c25e4b639caf3a302b15f54a/citc/google.py#L106

            "memory": int(math.pow(mt["memoryMb"], 0.7) * 0.9 + 500),


It appears that the AWS config is attempting to estimate how much memory will actually be available (versus what is advertised), but the code for GCP is drastically underestimating.

Heavily under-estimating the amount of available memory allows Slurm to be more tolerant of nodes which don't quite meet their advertised claims, however it can cause issues when jobs request a specific amount of memory.   These two cloud equations should probably be consistent, but I think the estimates need to be more conservative (lower) than AWS currently calculates, as shown by the example above with C5.12xlarge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Instance Memory calculation #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Instance Memory calculation #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions