This repository is intended for engineers looking to horizontally scale GPU-based Machine Learning (ML) workloads on Amazon ECS. This example is for demonstrative purposes only and is not intended for production use.
- 
By default, GPU utilization metrics are not part of the predefined metrics available with Application Autoscaling.
 - 
As such, you implement auto scaling based on custom metrics. See Autoscaling Amazon ECS services based on custom metrics with Application Auto Scaling
 - 
For NVIDIA-based GPUs, you use DCGM-Exporter in your container to expose GPU metrics. You can then use metrics such as
DCGM_FI_DEV_GPU_UTILandDCGM_FI_DEV_GPU_TEMPto determine your auto scaling behavior. Learn more about NVIDIA DGCM. 
- 
Fill the proper values on the
.envfile. - 
Install AWS CDK.
 - 
Use AWS CDK to deploy the AWS infrastructure.
 
cdk deploy --require-approval never
- Build and push image to Amazon ECR.
 
./build_image.sh
- Open 2 terminal session and exec into the ECS task.
 
TASK_ARN=
aws ecs execute-command \
  --region us-east-1 \
  --cluster ecs-gpu-demo \
  --task ${TASK_ARN} \
  --container gpu \
  --command "/bin/bash" \
  --interactive
- On one terminal, watch the GPU utilization.
 
watch -n0.1 nvidia-smi
- On the other terminal, stress test the GPU.
 
python3 test.py
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.
