Amazon ECS Auto Scaling for GPU-based Machine Learning Workloads

This repository is intended for engineers looking to horizontally scale GPU-based Machine Learning (ML) workloads on Amazon ECS. This example is for demonstrative purposes only and is not intended for production use.

How it works

By default, GPU utilization metrics are not part of the predefined metrics available with Application Autoscaling.
As such, you implement auto scaling based on custom metrics. See Autoscaling Amazon ECS services based on custom metrics with Application Auto Scaling
For NVIDIA-based GPUs, you use DCGM-Exporter in your container to expose GPU metrics. You can then use metrics such as DCGM_FI_DEV_GPU_UTIL and DCGM_FI_DEV_GPU_TEMP to determine your auto scaling behavior. Learn more about NVIDIA DGCM.

Setup

Fill the proper values on the .env file.
Install AWS CDK.
Use AWS CDK to deploy the AWS infrastructure.

cdk deploy --require-approval never

Build and push image to Amazon ECR.

./build_image.sh

Open 2 terminal session and exec into the ECS task.

TASK_ARN=
aws ecs execute-command \
  --region us-east-1 \
  --cluster ecs-gpu-demo \
  --task ${TASK_ARN} \
  --container gpu \
  --command "/bin/bash" \
  --interactive

On one terminal, watch the GPU utilization.

watch -n0.1 nvidia-smi

On the other terminal, stress test the GPU.

python3 test.py

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bin		bin
image		image
img		img
lib		lib
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
build_image.sh		build_image.sh
cdk.json		cdk.json
package.json		package.json
python3		python3
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Amazon ECS Auto Scaling for GPU-based Machine Learning Workloads

How it works

Setup

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

aws-samples/ecs-gpu-scaling

Folders and files

Latest commit

History

Repository files navigation

Amazon ECS Auto Scaling for GPU-based Machine Learning Workloads

How it works

Setup

Security

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages