Add MIG scripts for g7e (RTX PRO 6000 Blackwell) on EKS by yoosful · Pull Request #73 · aws-samples/awsome-inference

yoosful · 2026-04-24T19:26:27Z

Summary

Adds a g7e-blackwell/ subdirectory under 2.projects/mig-gpu-partitioning/ as a companion to the existing p5.48xlarge (H100) MIG guide.

g7e.2xlarge is the smallest G7e size (1 × NVIDIA RTX PRO 6000 Blackwell Server Edition, 96 GiB) and is the cheapest single-node way to exercise MIG end-to-end on EKS.
End-to-end: create a managed nodegroup (reusing an existing GPU nodegroup's IAM role), install the NVIDIA GPU Operator with mig.strategy=mixed + mig-manager, partition the GPU (default all-1g.24gb → 4 equal 24 GiB slices), schedule a smoke-test pod on a MIG slice, tear everything down.
Simple bash — no Terraform, no fork of the parent directory's structure.

Validated end-to-end against an EKS 1.33 cluster in us-west-2 on an AL2023 NVIDIA AMI.

What's in the directory

Script	Purpose
`env.sh`	Shared env (region, cluster, instance type, labels, taints)
`01-create-nodegroup.sh`	Create the managed nodegroup (supports `SUBNETS=` override for ICE)
`02-install-gpu-operator.sh`	helm install gpu-operator; patches a containerd drop-in for AL2023+containerd-v3 (see below)
`03-apply-mig-config.sh`	Apply `nvidia.com/mig.config=<profile>` and wait for `mig.config.state=success`
`04-test-mig.sh`	Schedule a short-lived CUDA pod on one `nvidia.com/mig-*` slice
`99-cleanup.sh`	Delete nodegroup (stops billing) + optional `EXTRA_SUBNETS` cleanup
`README.md`	Usage, prereqs, MIG profile discovery, capacity/ICE playbook, cgroup-fix explanation

Notable gotchas documented in the README

G7e capacity is flaky — describe-instance-type-offerings lists g7e.2xlarge in every us-west-2 AZ but real capacity wandered; tested via SUBNETS= pinning after ICE told us which AZ had room.
AL2023 NVIDIA + containerd v3 + gpu-operator v26 cgroup bug: the operator's nvidia-ctk runtime configure emits a v2-style drop-in without SystemdCgroup=true, which crashes every pod on the node with expected cgroupsPath to be of format "slice:prefix:name". 02-install-gpu-operator.sh now writes a correct v3 drop-in and restarts containerd+kubelet so the device plugin can actually advertise nvidia.com/mig-*.
Blackwell MIG granularity is 4 (vs. 7 on A100/H100) with slice sizes driven by the SKU. For the g7e.2xlarge 96 GiB SKU we verified all-1g.24gb, all-2g.48gb, and all-4g.96gb.

Test plan

./01-create-nodegroup.sh → nodegroup ACTIVE on g7e.2xlarge
./02-install-gpu-operator.sh → operator pods Running, containerd drop-in applied
./03-apply-mig-config.sh → mig.config.state=success, node advertises nvidia.com/mig-1g.24gb: 4
./04-test-mig.sh → pod scheduled on a MIG slice; nvidia-smi from the pod shows MIG 1g.24gb Device 0 with shared memory 24 GiB
./99-cleanup.sh → nodegroup deleted, no g7e instances remain

Validation evidence

Captured from the live run on g7e.2xlarge in us-west-2d (driver 580.126.09, CUDA 13.0).

1. nvidia-smi mig -lgip from the mig-manager pod after 03-apply-mig-config.sh — GPU partitioned into 4 × 1g.24gb:

+-------------------------------------------------------------------------------+
| GPU instance profiles:                                                        |
| GPU   Name               ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                                Free/Total   GiB              CE    JPEG  OFA  |
|===============================================================================|
|   0  MIG 1g.24gb         14     4/4        23.62      No     46     1     1   |
|                                                               1     1     0   |
+-------------------------------------------------------------------------------+

2. nvidia-smi from inside the smoke-test pod after 04-test-mig.sh — pod received a real MIG slice (GI 6 / CI 0, 24 GiB shared memory):

+-----------------------------------------+------------------------+----------------------+
|   0  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:2B:00.0 Off |                   On |
| N/A   22C    P8             28W /  600W |                  N/A   |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |              Shared Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                Shared BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|==================+==================================+===========+=======================|
|  0    6   0   0  |              64MiB / 24192MiB    | 46      0 |  1   1    1    0    1 |
|                  |               0MiB /  8327MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

Companion to the existing p5.48xlarge (H100) MIG guide. The smallest g7e size, g7e.2xlarge, ships a single RTX PRO 6000 Blackwell Server Edition and is the cheapest way to exercise MIG end-to-end on EKS. Scripts create a managed nodegroup, install the NVIDIA GPU operator with mig-manager (mixed strategy), partition the GPU (default all-1g.24gb -> 4 slices), run a smoke-test pod on a MIG slice, and tear everything down. README documents the gotchas we hit during validation: - G7e ICE is common and `describe-instance-type-offerings` lies about it, so SUBNETS can be used to pin a specific AZ after discovery. - AL2023 NVIDIA AMI ships containerd v3 but the gpu-operator v26 toolkit still emits a v2-style drop-in without `SystemdCgroup=true`, crashing every pod with "expected cgroupsPath ... slice:prefix:name". The operator script now writes a correct v3 drop-in and restarts containerd+kubelet so step 3 can actually advertise nvidia.com/mig-*.

Follow CONTRIBUTING.md guidance to list new projects in both README indexes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Call out that scripts assume an existing EKS cluster (same convention used by every other 2.projects/* guide) and link to 1.infrastructure/ for the cluster setup. Drop the fork-specific default cluster name. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

yoosful force-pushed the add-g7e-mig-scripts branch from 7141eba to b2cdfeb Compare April 25, 2026 02:55

yoosful and others added 2 commits April 25, 2026 12:23

Link g7e-blackwell MIG guide from top-level and 2.projects READMEs

96b6e2a

Follow CONTRIBUTING.md guidance to list new projects in both README indexes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MIG scripts for g7e (RTX PRO 6000 Blackwell) on EKS#73

Add MIG scripts for g7e (RTX PRO 6000 Blackwell) on EKS#73
yoosful wants to merge 3 commits intoaws-samples:mainfrom
yoosful:add-g7e-mig-scripts

yoosful commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yoosful commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the directory

Notable gotchas documented in the README

Test plan

Validation evidence

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yoosful commented Apr 24, 2026 •

edited

Loading