Add MIG scripts for g7e (RTX PRO 6000 Blackwell) on EKS#73
Open
yoosful wants to merge 3 commits intoaws-samples:mainfrom
Open
Add MIG scripts for g7e (RTX PRO 6000 Blackwell) on EKS#73yoosful wants to merge 3 commits intoaws-samples:mainfrom
yoosful wants to merge 3 commits intoaws-samples:mainfrom
Conversation
Companion to the existing p5.48xlarge (H100) MIG guide. The smallest g7e size, g7e.2xlarge, ships a single RTX PRO 6000 Blackwell Server Edition and is the cheapest way to exercise MIG end-to-end on EKS. Scripts create a managed nodegroup, install the NVIDIA GPU operator with mig-manager (mixed strategy), partition the GPU (default all-1g.24gb -> 4 slices), run a smoke-test pod on a MIG slice, and tear everything down. README documents the gotchas we hit during validation: - G7e ICE is common and `describe-instance-type-offerings` lies about it, so SUBNETS can be used to pin a specific AZ after discovery. - AL2023 NVIDIA AMI ships containerd v3 but the gpu-operator v26 toolkit still emits a v2-style drop-in without `SystemdCgroup=true`, crashing every pod with "expected cgroupsPath ... slice:prefix:name". The operator script now writes a correct v3 drop-in and restarts containerd+kubelet so step 3 can actually advertise nvidia.com/mig-*.
7141eba to
b2cdfeb
Compare
Follow CONTRIBUTING.md guidance to list new projects in both README indexes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Call out that scripts assume an existing EKS cluster (same convention used by every other 2.projects/* guide) and link to 1.infrastructure/ for the cluster setup. Drop the fork-specific default cluster name. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
g7e-blackwell/subdirectory under2.projects/mig-gpu-partitioning/as a companion to the existing p5.48xlarge (H100) MIG guide.g7e.2xlargeis the smallest G7e size (1 × NVIDIA RTX PRO 6000 Blackwell Server Edition, 96 GiB) and is the cheapest single-node way to exercise MIG end-to-end on EKS.mig.strategy=mixed+ mig-manager, partition the GPU (defaultall-1g.24gb→ 4 equal 24 GiB slices), schedule a smoke-test pod on a MIG slice, tear everything down.Validated end-to-end against an EKS 1.33 cluster in us-west-2 on an AL2023 NVIDIA AMI.
What's in the directory
env.sh01-create-nodegroup.shSUBNETS=override for ICE)02-install-gpu-operator.sh03-apply-mig-config.shnvidia.com/mig.config=<profile>and wait formig.config.state=success04-test-mig.shnvidia.com/mig-*slice99-cleanup.shEXTRA_SUBNETScleanupREADME.mdNotable gotchas documented in the README
describe-instance-type-offeringslists g7e.2xlarge in every us-west-2 AZ but real capacity wandered; tested viaSUBNETS=pinning after ICE told us which AZ had room.nvidia-ctk runtime configureemits a v2-style drop-in withoutSystemdCgroup=true, which crashes every pod on the node withexpected cgroupsPath to be of format "slice:prefix:name".02-install-gpu-operator.shnow writes a correct v3 drop-in and restarts containerd+kubelet so the device plugin can actually advertisenvidia.com/mig-*.all-1g.24gb,all-2g.48gb, andall-4g.96gb.Test plan
./01-create-nodegroup.sh→ nodegroup ACTIVE on g7e.2xlarge./02-install-gpu-operator.sh→ operator pods Running, containerd drop-in applied./03-apply-mig-config.sh→mig.config.state=success, node advertisesnvidia.com/mig-1g.24gb: 4./04-test-mig.sh→ pod scheduled on a MIG slice;nvidia-smifrom the pod showsMIG 1g.24gb Device 0with shared memory 24 GiB./99-cleanup.sh→ nodegroup deleted, no g7e instances remainValidation evidence
Captured from the live run on
g7e.2xlargein us-west-2d (driver 580.126.09, CUDA 13.0).1.
nvidia-smi mig -lgipfrom the mig-manager pod after03-apply-mig-config.sh— GPU partitioned into 4 × 1g.24gb:2.
nvidia-smifrom inside the smoke-test pod after04-test-mig.sh— pod received a real MIG slice (GI 6 / CI 0, 24 GiB shared memory):