Skip to content

Add MIG scripts for g7e (RTX PRO 6000 Blackwell) on EKS#73

Open
yoosful wants to merge 3 commits intoaws-samples:mainfrom
yoosful:add-g7e-mig-scripts
Open

Add MIG scripts for g7e (RTX PRO 6000 Blackwell) on EKS#73
yoosful wants to merge 3 commits intoaws-samples:mainfrom
yoosful:add-g7e-mig-scripts

Conversation

@yoosful
Copy link
Copy Markdown

@yoosful yoosful commented Apr 24, 2026

Summary

Adds a g7e-blackwell/ subdirectory under 2.projects/mig-gpu-partitioning/ as a companion to the existing p5.48xlarge (H100) MIG guide.

  • g7e.2xlarge is the smallest G7e size (1 × NVIDIA RTX PRO 6000 Blackwell Server Edition, 96 GiB) and is the cheapest single-node way to exercise MIG end-to-end on EKS.
  • End-to-end: create a managed nodegroup (reusing an existing GPU nodegroup's IAM role), install the NVIDIA GPU Operator with mig.strategy=mixed + mig-manager, partition the GPU (default all-1g.24gb → 4 equal 24 GiB slices), schedule a smoke-test pod on a MIG slice, tear everything down.
  • Simple bash — no Terraform, no fork of the parent directory's structure.

Validated end-to-end against an EKS 1.33 cluster in us-west-2 on an AL2023 NVIDIA AMI.

What's in the directory

Script Purpose
env.sh Shared env (region, cluster, instance type, labels, taints)
01-create-nodegroup.sh Create the managed nodegroup (supports SUBNETS= override for ICE)
02-install-gpu-operator.sh helm install gpu-operator; patches a containerd drop-in for AL2023+containerd-v3 (see below)
03-apply-mig-config.sh Apply nvidia.com/mig.config=<profile> and wait for mig.config.state=success
04-test-mig.sh Schedule a short-lived CUDA pod on one nvidia.com/mig-* slice
99-cleanup.sh Delete nodegroup (stops billing) + optional EXTRA_SUBNETS cleanup
README.md Usage, prereqs, MIG profile discovery, capacity/ICE playbook, cgroup-fix explanation

Notable gotchas documented in the README

  • G7e capacity is flakydescribe-instance-type-offerings lists g7e.2xlarge in every us-west-2 AZ but real capacity wandered; tested via SUBNETS= pinning after ICE told us which AZ had room.
  • AL2023 NVIDIA + containerd v3 + gpu-operator v26 cgroup bug: the operator's nvidia-ctk runtime configure emits a v2-style drop-in without SystemdCgroup=true, which crashes every pod on the node with expected cgroupsPath to be of format "slice:prefix:name". 02-install-gpu-operator.sh now writes a correct v3 drop-in and restarts containerd+kubelet so the device plugin can actually advertise nvidia.com/mig-*.
  • Blackwell MIG granularity is 4 (vs. 7 on A100/H100) with slice sizes driven by the SKU. For the g7e.2xlarge 96 GiB SKU we verified all-1g.24gb, all-2g.48gb, and all-4g.96gb.

Test plan

  • ./01-create-nodegroup.sh → nodegroup ACTIVE on g7e.2xlarge
  • ./02-install-gpu-operator.sh → operator pods Running, containerd drop-in applied
  • ./03-apply-mig-config.shmig.config.state=success, node advertises nvidia.com/mig-1g.24gb: 4
  • ./04-test-mig.sh → pod scheduled on a MIG slice; nvidia-smi from the pod shows MIG 1g.24gb Device 0 with shared memory 24 GiB
  • ./99-cleanup.sh → nodegroup deleted, no g7e instances remain

Validation evidence

Captured from the live run on g7e.2xlarge in us-west-2d (driver 580.126.09, CUDA 13.0).

1. nvidia-smi mig -lgip from the mig-manager pod after 03-apply-mig-config.sh — GPU partitioned into 4 × 1g.24gb:

+-------------------------------------------------------------------------------+
| GPU instance profiles:                                                        |
| GPU   Name               ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                                Free/Total   GiB              CE    JPEG  OFA  |
|===============================================================================|
|   0  MIG 1g.24gb         14     4/4        23.62      No     46     1     1   |
|                                                               1     1     0   |
+-------------------------------------------------------------------------------+

2. nvidia-smi from inside the smoke-test pod after 04-test-mig.sh — pod received a real MIG slice (GI 6 / CI 0, 24 GiB shared memory):

+-----------------------------------------+------------------------+----------------------+
|   0  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:2B:00.0 Off |                   On |
| N/A   22C    P8             28W /  600W |                  N/A   |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |              Shared Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                Shared BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|==================+==================================+===========+=======================|
|  0    6   0   0  |              64MiB / 24192MiB    | 46      0 |  1   1    1    0    1 |
|                  |               0MiB /  8327MiB    |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

Companion to the existing p5.48xlarge (H100) MIG guide. The smallest g7e
size, g7e.2xlarge, ships a single RTX PRO 6000 Blackwell Server Edition
and is the cheapest way to exercise MIG end-to-end on EKS. Scripts create
a managed nodegroup, install the NVIDIA GPU operator with mig-manager
(mixed strategy), partition the GPU (default all-1g.24gb -> 4 slices),
run a smoke-test pod on a MIG slice, and tear everything down.

README documents the gotchas we hit during validation:
- G7e ICE is common and `describe-instance-type-offerings` lies about it,
  so SUBNETS can be used to pin a specific AZ after discovery.
- AL2023 NVIDIA AMI ships containerd v3 but the gpu-operator v26 toolkit
  still emits a v2-style drop-in without `SystemdCgroup=true`, crashing
  every pod with "expected cgroupsPath ... slice:prefix:name". The
  operator script now writes a correct v3 drop-in and restarts
  containerd+kubelet so step 3 can actually advertise nvidia.com/mig-*.
@yoosful yoosful force-pushed the add-g7e-mig-scripts branch from 7141eba to b2cdfeb Compare April 25, 2026 02:55
yoosful and others added 2 commits April 25, 2026 12:23
Follow CONTRIBUTING.md guidance to list new projects in both README
indexes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Call out that scripts assume an existing EKS cluster (same convention
used by every other 2.projects/* guide) and link to 1.infrastructure/
for the cluster setup. Drop the fork-specific default cluster name.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant