Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion modules/ch2-mig/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,6 @@
*** xref:s1-mig-overview-2.adoc[]
*** xref:s1-mig-overview-3.adoc[]
*** xref:s1-mig-overview-4.adoc[]
*** xref:s1-mig-overview.adoc[]
*** xref:s1-mig-overview-5.adoc[]
*** xref:s2-mig-slicing-lab.adoc[]
*** xref:s2-mig-slicing-lab-2.adoc[]
239 changes: 1 addition & 238 deletions modules/ch2-mig/pages/s1-mig-overview-3.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -355,241 +355,4 @@ nodeSelector:
```

This architecture provides workload placement flexibility while optimizing each pool for its use case.
====

== MIG Benefits for MaaS

For Models-as-a-Service architectures, MIG provides measurable advantages that directly impact platform economics and operational reliability.

=== Cost Efficiency and ROI

* **Deploy 7 concurrent small models** on single A100-40GB (vs. 1 with full GPU allocation)
* **Reduce per-model GPU cost from $15,000 to $2,143** (7x cost reduction for `1g.5gb` profiles)
* **Increase cluster-wide GPU utilization from 33% to 78%** (2.4x improvement over exclusive allocation)
* **Achieve ROI break-even 2.4x faster** than full GPU deployments (10 months vs. 24 months baseline)
* **Right-size GPU resources** to actual model requirements, eliminating overprovisioning waste

**Concrete example**: Platform serving 21 small models previously required 21x A100 GPUs ($315,000 capital). With MIG `1g.5gb` profiles, same workload runs on 3x A100 GPUs ($45,000 capital), saving $270,000.

=== Performance Isolation and Predictability

* **Guarantee P99 latency variance <1%** (vs. 15-40% with time-slicing)
* **Hardware-enforced memory isolation** prevents out-of-memory (OOM) crosstalk between workloads
* **Dedicated streaming multiprocessors** eliminate compute contention and throttling
* **Enable SLA-backed inference** with measurable, enforceable performance guarantees
* **Prevent noisy neighbor problems** through physical resource partitioning

Time-slicing workloads experience latency spikes when concurrent requests arrive. MIG instances maintain consistent latency regardless of neighboring workload activity.

=== Operational Flexibility and Multi-Tenancy

* **Scale model instances independently** without affecting neighboring services (e.g., scale LLaMA-7B from 1 to 3 replicas without reprovisioning hardware)
* **Mix small and large models** on same physical GPU (e.g., `1g.5gb` microservices alongside `3g.20gb` large language models)
* **Support true multi-tenant deployments** with hardware isolation between teams
* **Assign dedicated MIG instances** to specific tenants or namespaces for guaranteed capacity
* **Enable chargeback and cost allocation** with accurate per-MIG-instance metrics from DCGM

=== Resource Predictability and Capacity Planning

* **Guaranteed GPU resources per deployment**: Each InferenceService gets dedicated SMs and VRAM
* **Reduce capacity planning variance from ±40% to ±5%**: MIG instances have predictable, fixed resource allocations
* **Simplify quota management**: Assign 2x `2g.10gb` instances per team, enforceable via Kueue (Chapter 4)
* **Enable accurate per-model billing**: DCGM provides per-MIG-instance utilization metrics for cost tracking
* **Predictable failure domains**: OOM or crash in one MIG instance doesn't affect others on same GPU

== Production Considerations and Gotchas

While MIG provides significant benefits, production deployments require careful planning around reconfiguration downtime, workload placement, and ongoing monitoring.

=== Reconfiguration Downtime Planning

Unlike time-slicing (which only requires device plugin pod restart), MIG reconfiguration requires GPU hardware reset and workload migration.

**Downtime Components for Profile Changes**:

1. **Node cordon**: Immediate (prevents new scheduling)
2. **Node drain**: 5-10 minutes (waiting for workload migration to other nodes)
3. **MIG mode enablement**: 10-15 seconds (GPU hardware reset)
4. **Profile application**: 30-60 seconds (instance creation)
5. **GFD label update**: 30-60 seconds (capability rescan)
6. **Device plugin restart**: 30-60 seconds (resource rediscovery)
7. **Node uncordon + scheduling**: 30-60 seconds

**Total**: 10-20 minutes per node for profile changes (per OpenShift documentation)

[WARNING]
====
**Plan MIG Profile Changes During Maintenance Windows**

Changing MIG profiles on production nodes serving live traffic causes:

* Immediate termination of all GPU workloads on that node
* Pod eviction and rescheduling to other nodes
* Temporary capacity reduction during reconfiguration
* Potential cascading failures if spare capacity insufficient

**For rolling MIG updates across N nodes**:

* Plan for **N × 20 minutes** total reconfiguration time
* Ensure **spare GPU capacity** exists for pod rescheduling (recommend N+2 redundancy)
* Use node cordoning to control blast radius: `oc adm cordon worker-gpu-0`
* Test profile changes in dev environment first
* Schedule during low-traffic windows (e.g., weekend maintenance)
* Document rollback procedure (revert to `all-disabled`, then previous profile)
====

=== Workload Placement Strategies

MIG instances require **explicit resource requests**. Workloads must request specific MIG profiles when using **mixed** advertisement strategy.

**Correct Resource Request (Mixed Strategy)**:

[source,yaml]
----
resources:
limits:
nvidia.com/mig-2g.10gb: 1 # <1>
----
<1> Request specific MIG profile, matches advertised resource name

**Common Mistake**:

[source,yaml]
----
resources:
limits:
nvidia.com/gpu: 1 # <1>
----
<1> ❌ Will NOT schedule on MIG-partitioned nodes using mixed strategy

**Consequences of incorrect requests**:

* Pod stuck in `Pending` state with `FailedScheduling` event
* Error: "Insufficient nvidia.com/gpu" (even though MIG instances available)
* Requires pod specification update and redeployment

[TIP]
====
**Use Admission Controllers for Default MIG Profiles**

For platforms deploying many inference services, create a `MutatingWebhookConfiguration` that automatically injects appropriate MIG resource requests based on pod annotations or namespace labels. This prevents scheduling failures from incorrect resource requests.

**Example implementation**:

* Pods in namespace `small-models` automatically get `nvidia.com/mig-1g.5gb: 1` injected
* Pods with annotation `mig-profile: medium` get `nvidia.com/mig-2g.10gb: 1`
* Pods without annotations remain unchanged (for flexibility)

This pattern reduces operational toil and prevents common scheduling errors, especially useful for teams with 50+ inference services.
====

=== Monitoring MIG Utilization

DCGM (Data Center GPU Manager, deployed in Chapter 1) provides per-MIG-instance metrics. Monitor these to validate your profile choices and identify optimization opportunities.

**Key Metrics to Track**:

* `DCGM_FI_PROF_GR_ENGINE_ACTIVE`: Compute utilization percentage per MIG instance
* `DCGM_FI_DEV_FB_USED`: Memory usage in bytes per MIG instance
* `DCGM_FI_DEV_GPU_TEMP`: Temperature per MIG instance (thermal throttling indicator)
* `DCGM_FI_DEV_POWER_USAGE`: Power consumption per instance

**Red Flags Indicating Misconfigurations**:

* **MIG instance averaging >85% memory usage**: Profile too small, workload may OOM, resize to larger profile
* **MIG instance averaging <20% compute utilization**: Profile too large, wasted resources, resize to smaller profile
* **Frequent pod evictions (OOMKilled)**: Memory oversubscription, increase profile memory allocation
* **High scheduling failure rate**: Insufficient MIG capacity for demand, add nodes or adjust profiles

**Example Prometheus query for MIG memory usage**:

[source,promql]
----
DCGM_FI_DEV_FB_USED{GPU_I_ID=~".*"} /
DCGM_FI_DEV_FB_TOTAL{GPU_I_ID=~".*"} * 100
----

You'll build comprehensive MIG monitoring dashboards in Chapter 3.

[IMPORTANT]
====
**MIG Instance Naming in Metrics**

DCGM metrics use `GPU_I_ID` label to distinguish MIG instances:

* `GPU=0, GPU_I_ID=0`: First MIG instance on GPU 0
* `GPU=0, GPU_I_ID=1`: Second MIG instance on GPU 0

This differs from Kubernetes resource names (`nvidia.com/mig-2g.10gb`). Correlation requires mapping GFD labels (showing profile types) to DCGM metrics (showing instance IDs).

**Dashboard design tip**: Group metrics by profile type using GFD label joins to show "all 2g.10gb instances" rather than raw instance IDs.
====

== Hardware Requirements and GPU Model Support

MIG is available exclusively on NVIDIA Ampere (A-series) and Hopper (H-series) architecture GPUs:

**Supported GPU Models**:

* **NVIDIA A30 (24GB)**: Ampere architecture, up to 4 MIG instances (profiles: 1g.6gb, 2g.12gb, 4g.24gb)
* **NVIDIA A100-40GB**: Ampere architecture, up to 7 MIG instances (profiles: 1g.5gb through 7g.40gb)
* **NVIDIA A100-80GB**: Ampere architecture, up to 7 MIG instances (profiles: 1g.10gb through 7g.80gb)
* **NVIDIA H100 (80GB/94GB)**: Hopper architecture, enhanced MIG with up to 7 instances plus confidential computing per-instance support

**Unsupported GPU Models** (must use time-slicing instead):

* NVIDIA V100 (Volta architecture)
* NVIDIA T4 (Turing architecture)
* NVIDIA A10, A16, A40 (Ampere, but consumer/workstation GPUs without MIG feature)
* NVIDIA L4, L40 (Ada Lovelace architecture)

[NOTE]
====
**H100 Enhancements Over A100**

H100 GPUs offer additional MIG capabilities beyond A100:

* **Confidential computing support per MIG instance**: Each instance can run in secure encrypted mode
* **Improved memory bandwidth allocation**: Better isolation between instances
* **Faster reconfiguration**: H100 MIG mode changes complete ~15% faster than A100

For most Models-as-a-Service inference workloads, **A100 provides optimal cost/performance balance**. H100 advantages primarily benefit specialized secure computation or extremely memory-bandwidth-intensive workloads. Evaluate H100 premium pricing (~2.5x A100 cost) against actual workload requirements before procurement.
====

== What's Next

In the next lab section, you will apply the concepts from this overview hands-on, transforming your GPU infrastructure from exclusive allocation to multi-tenant MIG-partitioned platform.

**Lab Activities**:

* **Verify MIG capability** on your A100 GPU nodes using `nvidia-smi`
* **Set MIG advertisement strategy** (mixed) via ClusterPolicy patch
* **Label GPU nodes** with built-in MIG profiles (`all-2g.10gb`, `all-balanced`)
* **Monitor MIG Manager logs** during reconfiguration to observe the 10-20 minute workflow
* **Verify MIG instance allocation** via `oc describe node` showing `nvidia.com/mig-*` resources
* **Create custom mig-parted ConfigMap** for heterogeneous profile combinations (e.g., 1x `3g.20gb` + 2x `2g.10gb`)
* **Deploy test CUDA workloads** requesting specific MIG profiles
* **Validate hardware isolation** by running `nvidia-smi` inside pods to confirm dedicated resource allocation

**Expected Outcomes**:

* Single A100-40GB will transform from `nvidia.com/gpu: 1` to `nvidia.com/mig-2g.10gb: 3` allocatable resources
* You'll deploy 3 concurrent CUDA test pods on one physical GPU, each in isolated MIG instances
* You'll observe <1% latency variance between pods running simultaneously (demonstrating isolation)
* You'll create the mixed-inference profile from the multi-tenant scenario (1x `3g.20gb` + 2x `2g.10gb`)
* You'll experience the full MIG reconfiguration workflow including node drain, profile application, and resource verification

**Skills You'll Develop**:

* Label-driven MIG configuration workflow
* Troubleshooting MIG Manager using logs and node labels
* Validating MIG instance creation with `nvidia-smi mig -lgi`
* Correlating Kubernetes resource advertisements with physical MIG partitions
* Planning MIG profile changes with minimal production disruption

This lab transforms your GPU infrastructure from **1 workload per GPU** (full allocation) to **3-7 workloads per GPU** (MIG partitioning) while maintaining production-grade isolation and predictable performance.

////
**Maximizing GPU ROI:**
Understand how the Multi-Instance GPU (MIG) feature splits hardware resources into multiple GPU instances, operating completely isolated from each other. Evaluate MIG advertisement strategies: Single (homogeneous) vs. Mixed (heterogeneous) slicing.
////
====
Loading
Loading