-
Notifications
You must be signed in to change notification settings - Fork 146
Updating CF stack to allow for local zone deployments for GB200 #838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1.2 TB storage which can be overridden by parameter. A role is also created which | ||
helps to execute HyperPod cluster operations. | ||
#TODO: DO THIS FOR EKS TOO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove, this template is just for Slurm, so context will not make sense for users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in draft mode still.
Type: AWS::EC2::EIP | ||
Properties: | ||
Domain: vpc | ||
NetworkBorderGroup: !Sub "${AWS::Region}-dfw-2" # TODO: CURRENTLY HARDCODED TO DFW. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we fetch Local Zone ID from somewhere in the stack, or as an input param, to make this dynamic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Like I said, this is a draft atm, will test with hardcoded values and make it dynamic later.
Domain: vpc | ||
NetworkBorderGroup: !Sub "${AWS::Region}-dfw-2" # TODO: CURRENTLY HARDCODED TO DFW. | ||
|
||
NATInstance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is NAT Instance required for local zone?
NatGW recommended for HA, resiliency. NAT instance introduced single point of failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. NATGW not supported in LZ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nat GW can be used in IAD, and traffic routed from LZ to the NATGW in IAD. @bluecrayon52 implemented this for EKS VPC stack. Plz consult with him. An AWS managed NAT removes single point of failure and also finger pointing to the CFN stack NAT Instance if customer experiences network outages. It comes with cost of potential latency overhead and traffic. Hrushi Gangur would be good SA to consult on this as well
…charges) + NAT Instance (not HA, but no cross AZ) Need to test both.
Remove constraint for AZ
These have both been tested and validated. I will streamline next week, and we'll be good to merge. |
Issue #, if available:
Description of changes:
This is currently a draft, pending tests. Please wait until testing is complete (ETA: EoD 09/02) until merging.
GB200 is only supported today in the DFW56 local zone. This CF stack adds a T/F toggle to allow for deployment in the local zone. It:
IsLocalZone
is set tofalse
, then it will automatically revert to original setup (deploying in chosen AZ). Otherwise, it expects a valid local zone in the AZ field.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.