Skip to content

Conversation

amanshanbhag
Copy link
Contributor

Issue #, if available:

Description of changes:
This is currently a draft, pending tests. Please wait until testing is complete (ETA: EoD 09/02) until merging.

GB200 is only supported today in the DFW56 local zone. This CF stack adds a T/F toggle to allow for deployment in the local zone. It:

  • Creates and configured a NAT instance (no NAT gateway) for internet access
  • Sets up all resources in the local zone (including file system)
  • Has the VPC still "span" the region -- with a subnet in the local zone
  • If IsLocalZone is set to false, then it will automatically revert to original setup (deploying in chosen AZ). Otherwise, it expects a valid local zone in the AZ field.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

1.2 TB storage which can be overridden by parameter. A role is also created which
helps to execute HyperPod cluster operations.
#TODO: DO THIS FOR EKS TOO.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove, this template is just for Slurm, so context will not make sense for users

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in draft mode still.

Type: AWS::EC2::EIP
Properties:
Domain: vpc
NetworkBorderGroup: !Sub "${AWS::Region}-dfw-2" # TODO: CURRENTLY HARDCODED TO DFW.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fetch Local Zone ID from somewhere in the stack, or as an input param, to make this dynamic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Like I said, this is a draft atm, will test with hardcoded values and make it dynamic later.

Domain: vpc
NetworkBorderGroup: !Sub "${AWS::Region}-dfw-2" # TODO: CURRENTLY HARDCODED TO DFW.

NATInstance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is NAT Instance required for local zone?

NatGW recommended for HA, resiliency. NAT instance introduced single point of failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. NATGW not supported in LZ

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nat GW can be used in IAD, and traffic routed from LZ to the NATGW in IAD. @bluecrayon52 implemented this for EKS VPC stack. Plz consult with him. An AWS managed NAT removes single point of failure and also finger pointing to the CFN stack NAT Instance if customer experiences network outages. It comes with cost of potential latency overhead and traffic. Hrushi Gangur would be good SA to consult on this as well

@amanshanbhag
Copy link
Contributor Author

These have both been tested and validated. I will streamline next week, and we'll be good to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants