From 9fa6f44d8bdedc18bbb66eabe09f831704c12374 Mon Sep 17 00:00:00 2001 From: Karlos K <168231563+kknoxrht@users.noreply.github.com> Date: Mon, 13 Apr 2026 18:43:16 -0500 Subject: [PATCH] mig page updates --- modules/ch2-mig/nav.adoc | 3 +- modules/ch2-mig/pages/s1-mig-overview-3.adoc | 239 +----------------- .../ch2-mig/pages/s2-mig-slicing-lab-2.adoc | 225 +++++++++++++++++ modules/ch2-mig/pages/s2-mig-slicing-lab.adoc | 229 +---------------- .../pages/s2-expose-metrics-lab.adoc | 5 + .../pages/s3-grafana-setup-lab.adoc | 5 + 6 files changed, 244 insertions(+), 462 deletions(-) create mode 100644 modules/ch2-mig/pages/s2-mig-slicing-lab-2.adoc diff --git a/modules/ch2-mig/nav.adoc b/modules/ch2-mig/nav.adoc index a3d4219..65314c9 100644 --- a/modules/ch2-mig/nav.adoc +++ b/modules/ch2-mig/nav.adoc @@ -3,5 +3,6 @@ *** xref:s1-mig-overview-2.adoc[] *** xref:s1-mig-overview-3.adoc[] *** xref:s1-mig-overview-4.adoc[] -*** xref:s1-mig-overview.adoc[] +*** xref:s1-mig-overview-5.adoc[] *** xref:s2-mig-slicing-lab.adoc[] +*** xref:s2-mig-slicing-lab-2.adoc[] diff --git a/modules/ch2-mig/pages/s1-mig-overview-3.adoc b/modules/ch2-mig/pages/s1-mig-overview-3.adoc index 10c0ba1..4d57a16 100644 --- a/modules/ch2-mig/pages/s1-mig-overview-3.adoc +++ b/modules/ch2-mig/pages/s1-mig-overview-3.adoc @@ -355,241 +355,4 @@ nodeSelector: ``` This architecture provides workload placement flexibility while optimizing each pool for its use case. -==== - -== MIG Benefits for MaaS - -For Models-as-a-Service architectures, MIG provides measurable advantages that directly impact platform economics and operational reliability. - -=== Cost Efficiency and ROI - -* **Deploy 7 concurrent small models** on single A100-40GB (vs. 1 with full GPU allocation) -* **Reduce per-model GPU cost from $15,000 to $2,143** (7x cost reduction for `1g.5gb` profiles) -* **Increase cluster-wide GPU utilization from 33% to 78%** (2.4x improvement over exclusive allocation) -* **Achieve ROI break-even 2.4x faster** than full GPU deployments (10 months vs. 24 months baseline) -* **Right-size GPU resources** to actual model requirements, eliminating overprovisioning waste - -**Concrete example**: Platform serving 21 small models previously required 21x A100 GPUs ($315,000 capital). With MIG `1g.5gb` profiles, same workload runs on 3x A100 GPUs ($45,000 capital), saving $270,000. - -=== Performance Isolation and Predictability - -* **Guarantee P99 latency variance <1%** (vs. 15-40% with time-slicing) -* **Hardware-enforced memory isolation** prevents out-of-memory (OOM) crosstalk between workloads -* **Dedicated streaming multiprocessors** eliminate compute contention and throttling -* **Enable SLA-backed inference** with measurable, enforceable performance guarantees -* **Prevent noisy neighbor problems** through physical resource partitioning - -Time-slicing workloads experience latency spikes when concurrent requests arrive. MIG instances maintain consistent latency regardless of neighboring workload activity. - -=== Operational Flexibility and Multi-Tenancy - -* **Scale model instances independently** without affecting neighboring services (e.g., scale LLaMA-7B from 1 to 3 replicas without reprovisioning hardware) -* **Mix small and large models** on same physical GPU (e.g., `1g.5gb` microservices alongside `3g.20gb` large language models) -* **Support true multi-tenant deployments** with hardware isolation between teams -* **Assign dedicated MIG instances** to specific tenants or namespaces for guaranteed capacity -* **Enable chargeback and cost allocation** with accurate per-MIG-instance metrics from DCGM - -=== Resource Predictability and Capacity Planning - -* **Guaranteed GPU resources per deployment**: Each InferenceService gets dedicated SMs and VRAM -* **Reduce capacity planning variance from ±40% to ±5%**: MIG instances have predictable, fixed resource allocations -* **Simplify quota management**: Assign 2x `2g.10gb` instances per team, enforceable via Kueue (Chapter 4) -* **Enable accurate per-model billing**: DCGM provides per-MIG-instance utilization metrics for cost tracking -* **Predictable failure domains**: OOM or crash in one MIG instance doesn't affect others on same GPU - -== Production Considerations and Gotchas - -While MIG provides significant benefits, production deployments require careful planning around reconfiguration downtime, workload placement, and ongoing monitoring. - -=== Reconfiguration Downtime Planning - -Unlike time-slicing (which only requires device plugin pod restart), MIG reconfiguration requires GPU hardware reset and workload migration. - -**Downtime Components for Profile Changes**: - -1. **Node cordon**: Immediate (prevents new scheduling) -2. **Node drain**: 5-10 minutes (waiting for workload migration to other nodes) -3. **MIG mode enablement**: 10-15 seconds (GPU hardware reset) -4. **Profile application**: 30-60 seconds (instance creation) -5. **GFD label update**: 30-60 seconds (capability rescan) -6. **Device plugin restart**: 30-60 seconds (resource rediscovery) -7. **Node uncordon + scheduling**: 30-60 seconds - -**Total**: 10-20 minutes per node for profile changes (per OpenShift documentation) - -[WARNING] -==== -**Plan MIG Profile Changes During Maintenance Windows** - -Changing MIG profiles on production nodes serving live traffic causes: - -* Immediate termination of all GPU workloads on that node -* Pod eviction and rescheduling to other nodes -* Temporary capacity reduction during reconfiguration -* Potential cascading failures if spare capacity insufficient - -**For rolling MIG updates across N nodes**: - -* Plan for **N × 20 minutes** total reconfiguration time -* Ensure **spare GPU capacity** exists for pod rescheduling (recommend N+2 redundancy) -* Use node cordoning to control blast radius: `oc adm cordon worker-gpu-0` -* Test profile changes in dev environment first -* Schedule during low-traffic windows (e.g., weekend maintenance) -* Document rollback procedure (revert to `all-disabled`, then previous profile) -==== - -=== Workload Placement Strategies - -MIG instances require **explicit resource requests**. Workloads must request specific MIG profiles when using **mixed** advertisement strategy. - -**Correct Resource Request (Mixed Strategy)**: - -[source,yaml] ----- -resources: - limits: - nvidia.com/mig-2g.10gb: 1 # <1> ----- -<1> Request specific MIG profile, matches advertised resource name - -**Common Mistake**: - -[source,yaml] ----- -resources: - limits: - nvidia.com/gpu: 1 # <1> ----- -<1> ❌ Will NOT schedule on MIG-partitioned nodes using mixed strategy - -**Consequences of incorrect requests**: - -* Pod stuck in `Pending` state with `FailedScheduling` event -* Error: "Insufficient nvidia.com/gpu" (even though MIG instances available) -* Requires pod specification update and redeployment - -[TIP] -==== -**Use Admission Controllers for Default MIG Profiles** - -For platforms deploying many inference services, create a `MutatingWebhookConfiguration` that automatically injects appropriate MIG resource requests based on pod annotations or namespace labels. This prevents scheduling failures from incorrect resource requests. - -**Example implementation**: - -* Pods in namespace `small-models` automatically get `nvidia.com/mig-1g.5gb: 1` injected -* Pods with annotation `mig-profile: medium` get `nvidia.com/mig-2g.10gb: 1` -* Pods without annotations remain unchanged (for flexibility) - -This pattern reduces operational toil and prevents common scheduling errors, especially useful for teams with 50+ inference services. -==== - -=== Monitoring MIG Utilization - -DCGM (Data Center GPU Manager, deployed in Chapter 1) provides per-MIG-instance metrics. Monitor these to validate your profile choices and identify optimization opportunities. - -**Key Metrics to Track**: - -* `DCGM_FI_PROF_GR_ENGINE_ACTIVE`: Compute utilization percentage per MIG instance -* `DCGM_FI_DEV_FB_USED`: Memory usage in bytes per MIG instance -* `DCGM_FI_DEV_GPU_TEMP`: Temperature per MIG instance (thermal throttling indicator) -* `DCGM_FI_DEV_POWER_USAGE`: Power consumption per instance - -**Red Flags Indicating Misconfigurations**: - -* **MIG instance averaging >85% memory usage**: Profile too small, workload may OOM, resize to larger profile -* **MIG instance averaging <20% compute utilization**: Profile too large, wasted resources, resize to smaller profile -* **Frequent pod evictions (OOMKilled)**: Memory oversubscription, increase profile memory allocation -* **High scheduling failure rate**: Insufficient MIG capacity for demand, add nodes or adjust profiles - -**Example Prometheus query for MIG memory usage**: - -[source,promql] ----- -DCGM_FI_DEV_FB_USED{GPU_I_ID=~".*"} / - DCGM_FI_DEV_FB_TOTAL{GPU_I_ID=~".*"} * 100 ----- - -You'll build comprehensive MIG monitoring dashboards in Chapter 3. - -[IMPORTANT] -==== -**MIG Instance Naming in Metrics** - -DCGM metrics use `GPU_I_ID` label to distinguish MIG instances: - -* `GPU=0, GPU_I_ID=0`: First MIG instance on GPU 0 -* `GPU=0, GPU_I_ID=1`: Second MIG instance on GPU 0 - -This differs from Kubernetes resource names (`nvidia.com/mig-2g.10gb`). Correlation requires mapping GFD labels (showing profile types) to DCGM metrics (showing instance IDs). - -**Dashboard design tip**: Group metrics by profile type using GFD label joins to show "all 2g.10gb instances" rather than raw instance IDs. -==== - -== Hardware Requirements and GPU Model Support - -MIG is available exclusively on NVIDIA Ampere (A-series) and Hopper (H-series) architecture GPUs: - -**Supported GPU Models**: - -* **NVIDIA A30 (24GB)**: Ampere architecture, up to 4 MIG instances (profiles: 1g.6gb, 2g.12gb, 4g.24gb) -* **NVIDIA A100-40GB**: Ampere architecture, up to 7 MIG instances (profiles: 1g.5gb through 7g.40gb) -* **NVIDIA A100-80GB**: Ampere architecture, up to 7 MIG instances (profiles: 1g.10gb through 7g.80gb) -* **NVIDIA H100 (80GB/94GB)**: Hopper architecture, enhanced MIG with up to 7 instances plus confidential computing per-instance support - -**Unsupported GPU Models** (must use time-slicing instead): - -* NVIDIA V100 (Volta architecture) -* NVIDIA T4 (Turing architecture) -* NVIDIA A10, A16, A40 (Ampere, but consumer/workstation GPUs without MIG feature) -* NVIDIA L4, L40 (Ada Lovelace architecture) - -[NOTE] -==== -**H100 Enhancements Over A100** - -H100 GPUs offer additional MIG capabilities beyond A100: - -* **Confidential computing support per MIG instance**: Each instance can run in secure encrypted mode -* **Improved memory bandwidth allocation**: Better isolation between instances -* **Faster reconfiguration**: H100 MIG mode changes complete ~15% faster than A100 - -For most Models-as-a-Service inference workloads, **A100 provides optimal cost/performance balance**. H100 advantages primarily benefit specialized secure computation or extremely memory-bandwidth-intensive workloads. Evaluate H100 premium pricing (~2.5x A100 cost) against actual workload requirements before procurement. -==== - -== What's Next - -In the next lab section, you will apply the concepts from this overview hands-on, transforming your GPU infrastructure from exclusive allocation to multi-tenant MIG-partitioned platform. - -**Lab Activities**: - -* **Verify MIG capability** on your A100 GPU nodes using `nvidia-smi` -* **Set MIG advertisement strategy** (mixed) via ClusterPolicy patch -* **Label GPU nodes** with built-in MIG profiles (`all-2g.10gb`, `all-balanced`) -* **Monitor MIG Manager logs** during reconfiguration to observe the 10-20 minute workflow -* **Verify MIG instance allocation** via `oc describe node` showing `nvidia.com/mig-*` resources -* **Create custom mig-parted ConfigMap** for heterogeneous profile combinations (e.g., 1x `3g.20gb` + 2x `2g.10gb`) -* **Deploy test CUDA workloads** requesting specific MIG profiles -* **Validate hardware isolation** by running `nvidia-smi` inside pods to confirm dedicated resource allocation - -**Expected Outcomes**: - -* Single A100-40GB will transform from `nvidia.com/gpu: 1` to `nvidia.com/mig-2g.10gb: 3` allocatable resources -* You'll deploy 3 concurrent CUDA test pods on one physical GPU, each in isolated MIG instances -* You'll observe <1% latency variance between pods running simultaneously (demonstrating isolation) -* You'll create the mixed-inference profile from the multi-tenant scenario (1x `3g.20gb` + 2x `2g.10gb`) -* You'll experience the full MIG reconfiguration workflow including node drain, profile application, and resource verification - -**Skills You'll Develop**: - -* Label-driven MIG configuration workflow -* Troubleshooting MIG Manager using logs and node labels -* Validating MIG instance creation with `nvidia-smi mig -lgi` -* Correlating Kubernetes resource advertisements with physical MIG partitions -* Planning MIG profile changes with minimal production disruption - -This lab transforms your GPU infrastructure from **1 workload per GPU** (full allocation) to **3-7 workloads per GPU** (MIG partitioning) while maintaining production-grade isolation and predictable performance. - -//// -**Maximizing GPU ROI:** -Understand how the Multi-Instance GPU (MIG) feature splits hardware resources into multiple GPU instances, operating completely isolated from each other. Evaluate MIG advertisement strategies: Single (homogeneous) vs. Mixed (heterogeneous) slicing. -//// +==== \ No newline at end of file diff --git a/modules/ch2-mig/pages/s2-mig-slicing-lab-2.adoc b/modules/ch2-mig/pages/s2-mig-slicing-lab-2.adoc new file mode 100644 index 0000000..980b845 --- /dev/null +++ b/modules/ch2-mig/pages/s2-mig-slicing-lab-2.adoc @@ -0,0 +1,225 @@ +:time_estimate: 45 + += Troubleshooting Common Issues + +_Estimated reading time: *{time_estimate} minutes*._ + +=== MIG Configuration Not Applied After 20 Minutes + +**Symptom:** After labeling node, `oc describe node` still shows `nvidia.com/gpu: X` instead of `nvidia.com/mig-*` resources + +**Diagnosis Steps:** + +. Verify node label was applied correctly ++ +[source,bash] +---- +$ oc get node worker-gpu-0.example.com -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config}' +all-1g.5gb +---- ++ +If output is empty, the label wasn't applied. Re-apply with `--overwrite` flag. + +. Check MIG Manager logs for errors ++ +[source,bash] +---- +$ oc logs -n nvidia-gpu-operator -l app=nvidia-mig-manager --tail=50 +---- ++ +Look for: +* `level=error` messages indicating validation failures +* `MIG configuration complete` success message +* `Applying MIG configuration` followed by profile name + +. Check MIG configuration state label ++ +[source,bash] +---- +$ oc get node worker-gpu-0.example.com -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}' +success +---- ++ +Possible states: +* `success`: MIG Manager completed configuration ✅ +* `pending`: Reconfiguration in progress (wait 10-20 min) +* `failed`: Error occurred, check MIG Manager logs ❌ + +. Verify GPU supports MIG mode ++ +[source,bash] +---- +$ oc debug node/worker-gpu-0.example.com +sh-4.4# chroot /host +sh-4.4# nvidia-smi -i 0 --query-gpu=mig.mode.current --format=csv +mig.mode.current +Enabled +---- ++ +If shows `Disabled` after 20 minutes, GPU may not support MIG or MIG Manager failed to enable it. + +=== Custom ConfigMap Not Recognized + +**Symptom:** After creating custom ConfigMap and patching ClusterPolicy, custom profiles not available + +**Solution:** + +. Verify ConfigMap exists in correct namespace ++ +[source,bash] +---- +$ oc get configmap -n nvidia-gpu-operator custom-mig-parted-config +NAME DATA AGE +custom-mig-parted-config 1 5m +---- + +. Verify ConfigMap is referenced in ClusterPolicy ++ +[source,bash] +---- +$ oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.spec.migManager.config.name}' +custom-mig-parted-config +---- ++ +If output is empty, the patch didn't apply. Re-run `oc patch clusterpolicy` command. + +. Restart MIG Manager pods to pick up new ConfigMap ++ +[source,bash] +---- +$ oc delete pods -n nvidia-gpu-operator -l app=nvidia-mig-manager +pod "nvidia-mig-manager-xxxxx" deleted +---- ++ +Pods will automatically recreate and load new ConfigMap. + +=== Pod Scheduling Fails: "Insufficient nvidia.com/mig-X" + +**Symptom:** Pod stuck in `Pending` state with event: + +---- +0/3 nodes are available: 3 Insufficient nvidia.com/mig-2g.10gb. +---- + +**Root Causes and Solutions:** + +**Cause 1: Nodes have different MIG profile** + +. Check what profiles are available on nodes ++ +[source,bash] +---- +$ oc get nodes -o custom-columns=NAME:.metadata.name,MIG-1G:.status.allocatable.nvidia\\.com/mig-1g\\.5gb,MIG-2G:.status.allocatable.nvidia\\.com/mig-2g\\.10gb +NAME MIG-1G MIG-2G +worker-gpu-0.example.com 7 +worker-gpu-1.example.com 3 +---- ++ +**Solution**: Either: +* Update pod resource request to match available profile (`nvidia.com/mig-1g.5gb: 1`) +* Relabel node to desired profile (triggers 10-20 min reconfiguration) +* Add nodeSelector to target specific node with correct profile + +**Cause 2: All MIG instances already allocated** + +. Check if MIG instances are exhausted ++ +[source,bash] +---- +$ oc describe node worker-gpu-0.example.com | grep -A 2 "Allocated resources:" +Allocated resources: + nvidia.com/mig-1g.5gb: 7 (100% of capacity) +---- ++ +**Solution**: Either: +* Wait for running pods to complete and release instances +* Add more GPU nodes to cluster +* Scale down lower-priority workloads + +**Cause 3: Using single strategy with profile-specific request** + +If ClusterPolicy has `mig.strategy: single`, pods must request `nvidia.com/gpu: 1`, NOT `nvidia.com/mig-*`. + +. Check MIG strategy ++ +[source,bash] +---- +$ oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.spec.mig.strategy}' +single +---- ++ +**Solution**: Change pod resource request from `nvidia.com/mig-2g.10gb: 1` to `nvidia.com/gpu: 1` + +=== MIG Reconfiguration Takes Longer Than 20 Minutes + +**Symptom:** Node labeled 30+ minutes ago but `mig.config.state` still shows `pending` + +**Diagnosis:** + +. Check if node has running GPU workloads ++ +[source,bash] +---- +$ oc get pods --all-namespaces -o wide | grep worker-gpu-0.example.com | grep -E "nvidia.com/(gpu|mig)" +---- ++ +If GPU workloads are running, MIG Manager cannot reconfigure until they terminate. + +. Drain node to evict GPU workloads ++ +[source,bash] +---- +$ oc adm drain worker-gpu-0.example.com --ignore-daemonsets --delete-emptydir-data +---- ++ +Wait for pod eviction (5-10 minutes), then check `mig.config.state` again. + +. Check MIG Manager pod status ++ +[source,bash] +---- +$ oc get pods -n nvidia-gpu-operator -l app=nvidia-mig-manager +NAME READY STATUS RESTARTS AGE +nvidia-mig-manager-xxxxx 1/1 Running 3 45m +---- ++ +If `RESTARTS > 0` and increasing, check logs for crash loop: ++ +[source,bash] +---- +$ oc logs -n nvidia-gpu-operator -l app=nvidia-mig-manager --previous +---- + +=== Verification: nvidia-smi Shows Different Instance Count Than Expected + +**Symptom:** Applied `all-1g.5gb` expecting 7 instances but `nvidia-smi -L` shows 3 instances + +**Possible Causes:** + +1. **Node has A100-80GB instead of A100-40GB**: Profile `all-1g.5gb` doesn't exist for 80GB model + - **Solution**: Use `all-1g.10gb` for A100-80GB GPUs + +2. **Custom ConfigMap overriding built-in profile**: Your ConfigMap may define `all-1g.5gb` differently + - **Solution**: Check ConfigMap content, remove custom definition to use auto-generated profile + +3. **MIG reconfiguration incomplete**: Device Plugin hasn't updated yet + - **Solution**: Wait full 20 minutes, check `nvidia.com/mig.config.state=success` + +== What's next + +In the next chapter, you will build a comprehensive observability stack by enabling OpenShift user-workload monitoring, exposing GPU metrics via DCGM Exporter, deploying the Grafana Operator, and creating custom dashboards to visualize **per-MIG-instance** telemetry alongside platform metrics. + +**Chapter 3 Preview**: You'll learn to: + +* Monitor individual MIG instance utilization (compute %, memory usage) +* Correlate MIG instance metrics with pod resource requests +* Create alerts for MIG instance oversubscription or underutilization +* Visualize the ROI gains from MIG partitioning (utilization improvement from 33% to 78%) +* Track MIG reconfiguration events and their impact on workload SLAs + +The observability stack validates the performance isolation and resource efficiency benefits you configured in this lab. + +//// +**MIG Slicing:** +Update node labels to apply a MIG partitioning profile (e.g., `nvidia.com/mig.config=all-1g.10gb`). Create a custom `mig-parted` config resource file for specific hardware deployments. Verify that the sliced MIG instances are exposed as allocatable resources on the node. +//// diff --git a/modules/ch2-mig/pages/s2-mig-slicing-lab.adoc b/modules/ch2-mig/pages/s2-mig-slicing-lab.adoc index 5815fcc..ba91426 100644 --- a/modules/ch2-mig/pages/s2-mig-slicing-lab.adoc +++ b/modules/ch2-mig/pages/s2-mig-slicing-lab.adoc @@ -4,6 +4,11 @@ _Estimated reading time: *{time_estimate} minutes*._ +[WARNING] +==== +This lab in a work in progress. The content is being developed and may contain placeholders, incomplete information. Please refer to the final version for the complete lab experience. +==== + Objective:: Update node labels to apply MIG partitioning profiles, create custom mig-parted configuration resources for specific hardware deployments, and verify that sliced MIG instances are exposed as allocatable resources on GPU nodes. @@ -526,226 +531,4 @@ With mixed strategy: All three teams share the same physical A100 GPU with hardware isolation, each getting appropriately-sized resources. This is the platform design from s1's multi-tenant scenario. Reference: xref:s1-mig-overview.adoc#_before_and_after_mig_resource_utilization[Multi-Tenant ROI Table in s1] -==== - -== Troubleshooting Common Issues - -=== MIG Configuration Not Applied After 20 Minutes - -**Symptom:** After labeling node, `oc describe node` still shows `nvidia.com/gpu: X` instead of `nvidia.com/mig-*` resources - -**Diagnosis Steps:** - -. Verify node label was applied correctly -+ -[source,bash] ----- -$ oc get node worker-gpu-0.example.com -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config}' -all-1g.5gb ----- -+ -If output is empty, the label wasn't applied. Re-apply with `--overwrite` flag. - -. Check MIG Manager logs for errors -+ -[source,bash] ----- -$ oc logs -n nvidia-gpu-operator -l app=nvidia-mig-manager --tail=50 ----- -+ -Look for: -* `level=error` messages indicating validation failures -* `MIG configuration complete` success message -* `Applying MIG configuration` followed by profile name - -. Check MIG configuration state label -+ -[source,bash] ----- -$ oc get node worker-gpu-0.example.com -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}' -success ----- -+ -Possible states: -* `success`: MIG Manager completed configuration ✅ -* `pending`: Reconfiguration in progress (wait 10-20 min) -* `failed`: Error occurred, check MIG Manager logs ❌ - -. Verify GPU supports MIG mode -+ -[source,bash] ----- -$ oc debug node/worker-gpu-0.example.com -sh-4.4# chroot /host -sh-4.4# nvidia-smi -i 0 --query-gpu=mig.mode.current --format=csv -mig.mode.current -Enabled ----- -+ -If shows `Disabled` after 20 minutes, GPU may not support MIG or MIG Manager failed to enable it. - -=== Custom ConfigMap Not Recognized - -**Symptom:** After creating custom ConfigMap and patching ClusterPolicy, custom profiles not available - -**Solution:** - -. Verify ConfigMap exists in correct namespace -+ -[source,bash] ----- -$ oc get configmap -n nvidia-gpu-operator custom-mig-parted-config -NAME DATA AGE -custom-mig-parted-config 1 5m ----- - -. Verify ConfigMap is referenced in ClusterPolicy -+ -[source,bash] ----- -$ oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.spec.migManager.config.name}' -custom-mig-parted-config ----- -+ -If output is empty, the patch didn't apply. Re-run `oc patch clusterpolicy` command. - -. Restart MIG Manager pods to pick up new ConfigMap -+ -[source,bash] ----- -$ oc delete pods -n nvidia-gpu-operator -l app=nvidia-mig-manager -pod "nvidia-mig-manager-xxxxx" deleted ----- -+ -Pods will automatically recreate and load new ConfigMap. - -=== Pod Scheduling Fails: "Insufficient nvidia.com/mig-X" - -**Symptom:** Pod stuck in `Pending` state with event: - ----- -0/3 nodes are available: 3 Insufficient nvidia.com/mig-2g.10gb. ----- - -**Root Causes and Solutions:** - -**Cause 1: Nodes have different MIG profile** - -. Check what profiles are available on nodes -+ -[source,bash] ----- -$ oc get nodes -o custom-columns=NAME:.metadata.name,MIG-1G:.status.allocatable.nvidia\\.com/mig-1g\\.5gb,MIG-2G:.status.allocatable.nvidia\\.com/mig-2g\\.10gb -NAME MIG-1G MIG-2G -worker-gpu-0.example.com 7 -worker-gpu-1.example.com 3 ----- -+ -**Solution**: Either: -* Update pod resource request to match available profile (`nvidia.com/mig-1g.5gb: 1`) -* Relabel node to desired profile (triggers 10-20 min reconfiguration) -* Add nodeSelector to target specific node with correct profile - -**Cause 2: All MIG instances already allocated** - -. Check if MIG instances are exhausted -+ -[source,bash] ----- -$ oc describe node worker-gpu-0.example.com | grep -A 2 "Allocated resources:" -Allocated resources: - nvidia.com/mig-1g.5gb: 7 (100% of capacity) ----- -+ -**Solution**: Either: -* Wait for running pods to complete and release instances -* Add more GPU nodes to cluster -* Scale down lower-priority workloads - -**Cause 3: Using single strategy with profile-specific request** - -If ClusterPolicy has `mig.strategy: single`, pods must request `nvidia.com/gpu: 1`, NOT `nvidia.com/mig-*`. - -. Check MIG strategy -+ -[source,bash] ----- -$ oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.spec.mig.strategy}' -single ----- -+ -**Solution**: Change pod resource request from `nvidia.com/mig-2g.10gb: 1` to `nvidia.com/gpu: 1` - -=== MIG Reconfiguration Takes Longer Than 20 Minutes - -**Symptom:** Node labeled 30+ minutes ago but `mig.config.state` still shows `pending` - -**Diagnosis:** - -. Check if node has running GPU workloads -+ -[source,bash] ----- -$ oc get pods --all-namespaces -o wide | grep worker-gpu-0.example.com | grep -E "nvidia.com/(gpu|mig)" ----- -+ -If GPU workloads are running, MIG Manager cannot reconfigure until they terminate. - -. Drain node to evict GPU workloads -+ -[source,bash] ----- -$ oc adm drain worker-gpu-0.example.com --ignore-daemonsets --delete-emptydir-data ----- -+ -Wait for pod eviction (5-10 minutes), then check `mig.config.state` again. - -. Check MIG Manager pod status -+ -[source,bash] ----- -$ oc get pods -n nvidia-gpu-operator -l app=nvidia-mig-manager -NAME READY STATUS RESTARTS AGE -nvidia-mig-manager-xxxxx 1/1 Running 3 45m ----- -+ -If `RESTARTS > 0` and increasing, check logs for crash loop: -+ -[source,bash] ----- -$ oc logs -n nvidia-gpu-operator -l app=nvidia-mig-manager --previous ----- - -=== Verification: nvidia-smi Shows Different Instance Count Than Expected - -**Symptom:** Applied `all-1g.5gb` expecting 7 instances but `nvidia-smi -L` shows 3 instances - -**Possible Causes:** - -1. **Node has A100-80GB instead of A100-40GB**: Profile `all-1g.5gb` doesn't exist for 80GB model - - **Solution**: Use `all-1g.10gb` for A100-80GB GPUs - -2. **Custom ConfigMap overriding built-in profile**: Your ConfigMap may define `all-1g.5gb` differently - - **Solution**: Check ConfigMap content, remove custom definition to use auto-generated profile - -3. **MIG reconfiguration incomplete**: Device Plugin hasn't updated yet - - **Solution**: Wait full 20 minutes, check `nvidia.com/mig.config.state=success` - -== What's next - -In the next chapter, you will build a comprehensive observability stack by enabling OpenShift user-workload monitoring, exposing GPU metrics via DCGM Exporter, deploying the Grafana Operator, and creating custom dashboards to visualize **per-MIG-instance** telemetry alongside platform metrics. - -**Chapter 3 Preview**: You'll learn to: - -* Monitor individual MIG instance utilization (compute %, memory usage) -* Correlate MIG instance metrics with pod resource requests -* Create alerts for MIG instance oversubscription or underutilization -* Visualize the ROI gains from MIG partitioning (utilization improvement from 33% to 78%) -* Track MIG reconfiguration events and their impact on workload SLAs - -The observability stack validates the performance isolation and resource efficiency benefits you configured in this lab. - -//// -**MIG Slicing:** -Update node labels to apply a MIG partitioning profile (e.g., `nvidia.com/mig.config=all-1g.10gb`). Create a custom `mig-parted` config resource file for specific hardware deployments. Verify that the sliced MIG instances are exposed as allocatable resources on the node. -//// +==== \ No newline at end of file diff --git a/modules/ch3-observability/pages/s2-expose-metrics-lab.adoc b/modules/ch3-observability/pages/s2-expose-metrics-lab.adoc index 35a4133..3c2cb37 100644 --- a/modules/ch3-observability/pages/s2-expose-metrics-lab.adoc +++ b/modules/ch3-observability/pages/s2-expose-metrics-lab.adoc @@ -4,6 +4,11 @@ _Estimated reading time: *{time_estimate} minutes*._ +[WARNING] +==== +This lab in a work in progress. The content is being developed and may contain placeholders, incomplete information. Please refer to the final version for the complete lab experience. +==== + Objective:: Enable OpenShift user-workload monitoring by modifying the cluster-monitoring-config ConfigMap, and label the GPU Operator namespace to expose GPU telemetry for Prometheus scraping via the NVIDIA DCGM Exporter. diff --git a/modules/ch3-observability/pages/s3-grafana-setup-lab.adoc b/modules/ch3-observability/pages/s3-grafana-setup-lab.adoc index f05dd0c..61bc837 100644 --- a/modules/ch3-observability/pages/s3-grafana-setup-lab.adoc +++ b/modules/ch3-observability/pages/s3-grafana-setup-lab.adoc @@ -4,6 +4,11 @@ _Estimated reading time: *{time_estimate} minutes*._ +[WARNING] +==== +This lab in a work in progress. The content is being developed and may contain placeholders, incomplete information. Please refer to the final version for the complete lab experience. +==== + Objective:: Install the Grafana Operator from OperatorHub, create a Grafana instance Custom Resource, and set up secure access to cluster metrics by creating a ServiceAccount with cluster-monitoring-view permissions and configuring a GrafanaDatasource pointing to the Thanos Querier endpoint.