Skip to content

Instance Memory calculation #17

@ghost

Description

I recently tried to start up a c5.12xlarge instance on AWS, and ran into the case of the /mnt/shared/etc/slurm.conf file claiming that the instance should have RealMem=94992, but when the node comes up, slurmctld.log shows that the node has less memory than slurm.conf indicates, and thus Slurm rejects the node (puts it in DRAIN state):

[2021-03-05T18:40:19.392] _slurm_rpc_submit_batch_job: JobId=9 InitPrio=4294901757 usec=551
[2021-03-05T18:40:19.879] sched: Allocate JobId=9 NodeList=nice-wolf-c5-12xlarge-0001 #CPUs=48 Partition=compute
[2021-03-05T18:41:58.852] error: Node nice-wolf-c5-12xlarge-0001 has low real_memory size (94256 < 94992)
[2021-03-05T18:41:58.852] Node nice-wolf-c5-12xlarge-0001 now responding
[2021-03-05T18:41:58.852] error: Setting node nice-wolf-c5-12xlarge-0001 state to DRAIN
[2021-03-05T18:41:58.852] drain_nodes: node nice-wolf-c5-12xlarge-0001 state set to DRAIN
[2021-03-05T18:41:58.852] error: _slurm_rpc_node_registration node=nice-wolf-c5-12xlarge-0001: Invalid argument
[2021-03-05T18:41:59.855] error: Node nice-wolf-c5-12xlarge-0001 has low real_memory size (94256 < 94992)
[2021-03-05T18:41:59.855] error: _slurm_rpc_node_registration node=nice-wolf-c5-12xlarge-0001: Invalid argument

This led me to the following calculation for expected RealMem for AWS:

"memory": d["MemoryInfo"]["SizeInMiB"]

        "memory": d["MemoryInfo"]["SizeInMiB"] - int(math.pow(d["MemoryInfo"]["SizeInMiB"], 0.7) * 0.9 + 500),

Contrast this to GCP memory calculation:

"memory": int(math.pow(mt["memoryMb"], 0.7) * 0.9 + 500),

        "memory": int(math.pow(mt["memoryMb"], 0.7) * 0.9 + 500),

It appears that the AWS config is attempting to estimate how much memory will actually be available (versus what is advertised), but the code for GCP is drastically underestimating.

Heavily under-estimating the amount of available memory allows Slurm to be more tolerant of nodes which don't quite meet their advertised claims, however it can cause issues when jobs request a specific amount of memory. These two cloud equations should probably be consistent, but I think the estimates need to be more conservative (lower) than AWS currently calculates, as shown by the example above with C5.12xlarge.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions