From 9fa6f44d8bdedc18bbb66eabe09f831704c12374 Mon Sep 17 00:00:00 2001
From: Karlos K <168231563+kknoxrht@users.noreply.github.com>
Date: Mon, 13 Apr 2026 18:43:16 -0500
Subject: [PATCH] mig page updates

---
 modules/ch2-mig/nav.adoc                      |   3 +-
 modules/ch2-mig/pages/s1-mig-overview-3.adoc  | 239 +-----------------
 .../ch2-mig/pages/s2-mig-slicing-lab-2.adoc   | 225 +++++++++++++++++
 modules/ch2-mig/pages/s2-mig-slicing-lab.adoc | 229 +----------------
 .../pages/s2-expose-metrics-lab.adoc          |   5 +
 .../pages/s3-grafana-setup-lab.adoc           |   5 +
 6 files changed, 244 insertions(+), 462 deletions(-)
 create mode 100644 modules/ch2-mig/pages/s2-mig-slicing-lab-2.adoc

diff --git a/modules/ch2-mig/nav.adoc b/modules/ch2-mig/nav.adoc
index a3d4219..65314c9 100644
--- a/modules/ch2-mig/nav.adoc
+++ b/modules/ch2-mig/nav.adoc
@@ -3,5 +3,6 @@
 *** xref:s1-mig-overview-2.adoc[]
 *** xref:s1-mig-overview-3.adoc[]
 *** xref:s1-mig-overview-4.adoc[]
-*** xref:s1-mig-overview.adoc[]
+*** xref:s1-mig-overview-5.adoc[]
 *** xref:s2-mig-slicing-lab.adoc[]
+*** xref:s2-mig-slicing-lab-2.adoc[]
diff --git a/modules/ch2-mig/pages/s1-mig-overview-3.adoc b/modules/ch2-mig/pages/s1-mig-overview-3.adoc
index 10c0ba1..4d57a16 100644
--- a/modules/ch2-mig/pages/s1-mig-overview-3.adoc
+++ b/modules/ch2-mig/pages/s1-mig-overview-3.adoc
@@ -355,241 +355,4 @@ nodeSelector:
 ```
 
 This architecture provides workload placement flexibility while optimizing each pool for its use case.
-====
-
-== MIG Benefits for MaaS
-
-For Models-as-a-Service architectures, MIG provides measurable advantages that directly impact platform economics and operational reliability.
-
-=== Cost Efficiency and ROI
-
-* **Deploy 7 concurrent small models** on single A100-40GB (vs. 1 with full GPU allocation)
-* **Reduce per-model GPU cost from $15,000 to $2,143** (7x cost reduction for `1g.5gb` profiles)
-* **Increase cluster-wide GPU utilization from 33% to 78%** (2.4x improvement over exclusive allocation)
-* **Achieve ROI break-even 2.4x faster** than full GPU deployments (10 months vs. 24 months baseline)
-* **Right-size GPU resources** to actual model requirements, eliminating overprovisioning waste
-
-**Concrete example**: Platform serving 21 small models previously required 21x A100 GPUs ($315,000 capital). With MIG `1g.5gb` profiles, same workload runs on 3x A100 GPUs ($45,000 capital), saving $270,000.
-
-=== Performance Isolation and Predictability
-
-* **Guarantee P99 latency variance <1%** (vs. 15-40% with time-slicing)
-* **Hardware-enforced memory isolation** prevents out-of-memory (OOM) crosstalk between workloads
-* **Dedicated streaming multiprocessors** eliminate compute contention and throttling
-* **Enable SLA-backed inference** with measurable, enforceable performance guarantees
-* **Prevent noisy neighbor problems** through physical resource partitioning
-
-Time-slicing workloads experience latency spikes when concurrent requests arrive. MIG instances maintain consistent latency regardless of neighboring workload activity.
-
-=== Operational Flexibility and Multi-Tenancy
-
-* **Scale model instances independently** without affecting neighboring services (e.g., scale LLaMA-7B from 1 to 3 replicas without reprovisioning hardware)
-* **Mix small and large models** on same physical GPU (e.g., `1g.5gb` microservices alongside `3g.20gb` large language models)
-* **Support true multi-tenant deployments** with hardware isolation between teams
-* **Assign dedicated MIG instances** to specific tenants or namespaces for guaranteed capacity
-* **Enable chargeback and cost allocation** with accurate per-MIG-instance metrics from DCGM
-
-=== Resource Predictability and Capacity Planning
-
-* **Guaranteed GPU resources per deployment**: Each InferenceService gets dedicated SMs and VRAM
-* **Reduce capacity planning variance from ±40% to ±5%**: MIG instances have predictable, fixed resource allocations
-* **Simplify quota management**: Assign 2x `2g.10gb` instances per team, enforceable via Kueue (Chapter 4)
-* **Enable accurate per-model billing**: DCGM provides per-MIG-instance utilization metrics for cost tracking
-* **Predictable failure domains**: OOM or crash in one MIG instance doesn't affect others on same GPU
-
-== Production Considerations and Gotchas
-
-While MIG provides significant benefits, production deployments require careful planning around reconfiguration downtime, workload placement, and ongoing monitoring.
-
-=== Reconfiguration Downtime Planning
-
-Unlike time-slicing (which only requires device plugin pod restart), MIG reconfiguration requires GPU hardware reset and workload migration.
-
-**Downtime Components for Profile Changes**:
-
-1. **Node cordon**: Immediate (prevents new scheduling)
-2. **Node drain**: 5-10 minutes (waiting for workload migration to other nodes)
-3. **MIG mode enablement**: 10-15 seconds (GPU hardware reset)
-4. **Profile application**: 30-60 seconds (instance creation)
-5. **GFD label update**: 30-60 seconds (capability rescan)
-6. **Device plugin restart**: 30-60 seconds (resource rediscovery)
-7. **Node uncordon + scheduling**: 30-60 seconds
-
-**Total**: 10-20 minutes per node for profile changes (per OpenShift documentation)
-
-[WARNING]
-====
-**Plan MIG Profile Changes During Maintenance Windows**
-
-Changing MIG profiles on production nodes serving live traffic causes:
-
-* Immediate termination of all GPU workloads on that node
-* Pod eviction and rescheduling to other nodes
-* Temporary capacity reduction during reconfiguration
-* Potential cascading failures if spare capacity insufficient
-
-**For rolling MIG updates across N nodes**:
-
-* Plan for **N × 20 minutes** total reconfiguration time
-* Ensure **spare GPU capacity** exists for pod rescheduling (recommend N+2 redundancy)
-* Use node cordoning to control blast radius: `oc adm cordon worker-gpu-0`
-* Test profile changes in dev environment first
-* Schedule during low-traffic windows (e.g., weekend maintenance)
-* Document rollback procedure (revert to `all-disabled`, then previous profile)
-====
-
-=== Workload Placement Strategies
-
-MIG instances require **explicit resource requests**. Workloads must request specific MIG profiles when using **mixed** advertisement strategy.
-
-**Correct Resource Request (Mixed Strategy)**:
-
-[source,yaml]
-----
-resources:
-  limits:
-    nvidia.com/mig-2g.10gb: 1  # <1>
-----
-<1> Request specific MIG profile, matches advertised resource name
-
-**Common Mistake**:
-
-[source,yaml]
-----
-resources:
-  limits:
-    nvidia.com/gpu: 1  # <1>
-----
-<1> ❌ Will NOT schedule on MIG-partitioned nodes using mixed strategy
-
-**Consequences of incorrect requests**:
-
-* Pod stuck in `Pending` state with `FailedScheduling` event
-* Error: "Insufficient nvidia.com/gpu" (even though MIG instances available)
-* Requires pod specification update and redeployment
-
-[TIP]
-====
-**Use Admission Controllers for Default MIG Profiles**
-
-For platforms deploying many inference services, create a `MutatingWebhookConfiguration` that automatically injects appropriate MIG resource requests based on pod annotations or namespace labels. This prevents scheduling failures from incorrect resource requests.
-
-**Example implementation**:
-
-* Pods in namespace `small-models` automatically get `nvidia.com/mig-1g.5gb: 1` injected
-* Pods with annotation `mig-profile: medium` get `nvidia.com/mig-2g.10gb: 1`
-* Pods without annotations remain unchanged (for flexibility)
-
-This pattern reduces operational toil and prevents common scheduling errors, especially useful for teams with 50+ inference services.
-====
-
-=== Monitoring MIG Utilization
-
-DCGM (Data Center GPU Manager, deployed in Chapter 1) provides per-MIG-instance metrics. Monitor these to validate your profile choices and identify optimization opportunities.
-
-**Key Metrics to Track**:
-
-* `DCGM_FI_PROF_GR_ENGINE_ACTIVE`: Compute utilization percentage per MIG instance
-* `DCGM_FI_DEV_FB_USED`: Memory usage in bytes per MIG instance
-* `DCGM_FI_DEV_GPU_TEMP`: Temperature per MIG instance (thermal throttling indicator)
-* `DCGM_FI_DEV_POWER_USAGE`: Power consumption per instance
-
-**Red Flags Indicating Misconfigurations**:
-
-* **MIG instance averaging >85% memory usage**: Profile too small, workload may OOM, resize to larger profile
-* **MIG instance averaging <20% compute utilization**: Profile too large, wasted resources, resize to smaller profile
-* **Frequent pod evictions (OOMKilled)**: Memory oversubscription, increase profile memory allocation
-* **High scheduling failure rate**: Insufficient MIG capacity for demand, add nodes or adjust profiles
-
-**Example Prometheus query for MIG memory usage**:
-
-[source,promql]
-----
-DCGM_FI_DEV_FB_USED{GPU_I_ID=~".*"} /
-  DCGM_FI_DEV_FB_TOTAL{GPU_I_ID=~".*"} * 100
-----
-
-You'll build comprehensive MIG monitoring dashboards in Chapter 3.
-
-[IMPORTANT]
-====
-**MIG Instance Naming in Metrics**
-
-DCGM metrics use `GPU_I_ID` label to distinguish MIG instances:
-
-* `GPU=0, GPU_I_ID=0`: First MIG instance on GPU 0
-* `GPU=0, GPU_I_ID=1`: Second MIG instance on GPU 0
-
-This differs from Kubernetes resource names (`nvidia.com/mig-2g.10gb`). Correlation requires mapping GFD labels (showing profile types) to DCGM metrics (showing instance IDs).
-
-**Dashboard design tip**: Group metrics by profile type using GFD label joins to show "all 2g.10gb instances" rather than raw instance IDs.
-====
-
-== Hardware Requirements and GPU Model Support
-
-MIG is available exclusively on NVIDIA Ampere (A-series) and Hopper (H-series) architecture GPUs:
-
-**Supported GPU Models**:
-
-* **NVIDIA A30 (24GB)**: Ampere architecture, up to 4 MIG instances (profiles: 1g.6gb, 2g.12gb, 4g.24gb)
-* **NVIDIA A100-40GB**: Ampere architecture, up to 7 MIG instances (profiles: 1g.5gb through 7g.40gb)
-* **NVIDIA A100-80GB**: Ampere architecture, up to 7 MIG instances (profiles: 1g.10gb through 7g.80gb)
-* **NVIDIA H100 (80GB/94GB)**: Hopper architecture, enhanced MIG with up to 7 instances plus confidential computing per-instance support
-
-**Unsupported GPU Models** (must use time-slicing instead):
-
-* NVIDIA V100 (Volta architecture)
-* NVIDIA T4 (Turing architecture)
-* NVIDIA A10, A16, A40 (Ampere, but consumer/workstation GPUs without MIG feature)
-* NVIDIA L4, L40 (Ada Lovelace architecture)
-
-[NOTE]
-====
-**H100 Enhancements Over A100**
-
-H100 GPUs offer additional MIG capabilities beyond A100:
-
-* **Confidential computing support per MIG instance**: Each instance can run in secure encrypted mode
-* **Improved memory bandwidth allocation**: Better isolation between instances
-* **Faster reconfiguration**: H100 MIG mode changes complete ~15% faster than A100
-
-For most Models-as-a-Service inference workloads, **A100 provides optimal cost/performance balance**. H100 advantages primarily benefit specialized secure computation or extremely memory-bandwidth-intensive workloads. Evaluate H100 premium pricing (~2.5x A100 cost) against actual workload requirements before procurement.
-====
-
-== What's Next
-
-In the next lab section, you will apply the concepts from this overview hands-on, transforming your GPU infrastructure from exclusive allocation to multi-tenant MIG-partitioned platform.
-
-**Lab Activities**:
-
-* **Verify MIG capability** on your A100 GPU nodes using `nvidia-smi`
-* **Set MIG advertisement strategy** (mixed) via ClusterPolicy patch
-* **Label GPU nodes** with built-in MIG profiles (`all-2g.10gb`, `all-balanced`)
-* **Monitor MIG Manager logs** during reconfiguration to observe the 10-20 minute workflow
-* **Verify MIG instance allocation** via `oc describe node` showing `nvidia.com/mig-*` resources
-* **Create custom mig-parted ConfigMap** for heterogeneous profile combinations (e.g., 1x `3g.20gb` + 2x `2g.10gb`)
-* **Deploy test CUDA workloads** requesting specific MIG profiles
-* **Validate hardware isolation** by running `nvidia-smi` inside pods to confirm dedicated resource allocation
-
-**Expected Outcomes**:
-
-* Single A100-40GB will transform from `nvidia.com/gpu: 1` to `nvidia.com/mig-2g.10gb: 3` allocatable resources
-* You'll deploy 3 concurrent CUDA test pods on one physical GPU, each in isolated MIG instances
-* You'll observe <1% latency variance between pods running simultaneously (demonstrating isolation)
-* You'll create the mixed-inference profile from the multi-tenant scenario (1x `3g.20gb` + 2x `2g.10gb`)
-* You'll experience the full MIG reconfiguration workflow including node drain, profile application, and resource verification
-
-**Skills You'll Develop**:
-
-* Label-driven MIG configuration workflow
-* Troubleshooting MIG Manager using logs and node labels
-* Validating MIG instance creation with `nvidia-smi mig -lgi`
-* Correlating Kubernetes resource advertisements with physical MIG partitions
-* Planning MIG profile changes with minimal production disruption
-
-This lab transforms your GPU infrastructure from **1 workload per GPU** (full allocation) to **3-7 workloads per GPU** (MIG partitioning) while maintaining production-grade isolation and predictable performance.
-
-////
-**Maximizing GPU ROI:**
-Understand how the Multi-Instance GPU (MIG) feature splits hardware resources into multiple GPU instances, operating completely isolated from each other. Evaluate MIG advertisement strategies: Single (homogeneous) vs. Mixed (heterogeneous) slicing.
-////
+====
\ No newline at end of file
diff --git a/modules/ch2-mig/pages/s2-mig-slicing-lab-2.adoc b/modules/ch2-mig/pages/s2-mig-slicing-lab-2.adoc
new file mode 100644
index 0000000..980b845
--- /dev/null
+++ b/modules/ch2-mig/pages/s2-mig-slicing-lab-2.adoc
@@ -0,0 +1,225 @@
+:time_estimate: 45
+
+= Troubleshooting Common Issues
+
+_Estimated reading time: *{time_estimate} minutes*._
+
+=== MIG Configuration Not Applied After 20 Minutes
+
+**Symptom:** After labeling node, `oc describe node` still shows `nvidia.com/gpu: X` instead of `nvidia.com/mig-*` resources
+
+**Diagnosis Steps:**
+
+. Verify node label was applied correctly
++
+[source,bash]
+----
+$ oc get node worker-gpu-0.example.com -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config}'
+all-1g.5gb
+----
++
+If output is empty, the label wasn't applied. Re-apply with `--overwrite` flag.
+
+. Check MIG Manager logs for errors
++
+[source,bash]
+----
+$ oc logs -n nvidia-gpu-operator -l app=nvidia-mig-manager --tail=50
+----
++
+Look for:
+* `level=error` messages indicating validation failures
+* `MIG configuration complete` success message
+* `Applying MIG configuration` followed by profile name
+
+. Check MIG configuration state label
++
+[source,bash]
+----
+$ oc get node worker-gpu-0.example.com -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'
+success
+----
++
+Possible states:
+* `success`: MIG Manager completed configuration ✅
+* `pending`: Reconfiguration in progress (wait 10-20 min)
+* `failed`: Error occurred, check MIG Manager logs ❌
+
+. Verify GPU supports MIG mode
++
+[source,bash]
+----
+$ oc debug node/worker-gpu-0.example.com
+sh-4.4# chroot /host
+sh-4.4# nvidia-smi -i 0 --query-gpu=mig.mode.current --format=csv
+mig.mode.current
+Enabled
+----
++
+If shows `Disabled` after 20 minutes, GPU may not support MIG or MIG Manager failed to enable it.
+
+=== Custom ConfigMap Not Recognized
+
+**Symptom:** After creating custom ConfigMap and patching ClusterPolicy, custom profiles not available
+
+**Solution:**
+
+. Verify ConfigMap exists in correct namespace
++
+[source,bash]
+----
+$ oc get configmap -n nvidia-gpu-operator custom-mig-parted-config
+NAME                        DATA   AGE
+custom-mig-parted-config    1      5m
+----
+
+. Verify ConfigMap is referenced in ClusterPolicy
++
+[source,bash]
+----
+$ oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.spec.migManager.config.name}'
+custom-mig-parted-config
+----
++
+If output is empty, the patch didn't apply. Re-run `oc patch clusterpolicy` command.
+
+. Restart MIG Manager pods to pick up new ConfigMap
++
+[source,bash]
+----
+$ oc delete pods -n nvidia-gpu-operator -l app=nvidia-mig-manager
+pod "nvidia-mig-manager-xxxxx" deleted
+----
++
+Pods will automatically recreate and load new ConfigMap.
+
+=== Pod Scheduling Fails: "Insufficient nvidia.com/mig-X"
+
+**Symptom:** Pod stuck in `Pending` state with event:
+
+----
+0/3 nodes are available: 3 Insufficient nvidia.com/mig-2g.10gb.
+----
+
+**Root Causes and Solutions:**
+
+**Cause 1: Nodes have different MIG profile**
+
+. Check what profiles are available on nodes
++
+[source,bash]
+----
+$ oc get nodes -o custom-columns=NAME:.metadata.name,MIG-1G:.status.allocatable.nvidia\\.com/mig-1g\\.5gb,MIG-2G:.status.allocatable.nvidia\\.com/mig-2g\\.10gb
+NAME                       MIG-1G   MIG-2G
+worker-gpu-0.example.com   7        <none>
+worker-gpu-1.example.com   <none>   3
+----
++
+**Solution**: Either:
+* Update pod resource request to match available profile (`nvidia.com/mig-1g.5gb: 1`)
+* Relabel node to desired profile (triggers 10-20 min reconfiguration)
+* Add nodeSelector to target specific node with correct profile
+
+**Cause 2: All MIG instances already allocated**
+
+. Check if MIG instances are exhausted
++
+[source,bash]
+----
+$ oc describe node worker-gpu-0.example.com | grep -A 2 "Allocated resources:"
+Allocated resources:
+  nvidia.com/mig-1g.5gb:  7  (100% of capacity)
+----
++
+**Solution**: Either:
+* Wait for running pods to complete and release instances
+* Add more GPU nodes to cluster
+* Scale down lower-priority workloads
+
+**Cause 3: Using single strategy with profile-specific request**
+
+If ClusterPolicy has `mig.strategy: single`, pods must request `nvidia.com/gpu: 1`, NOT `nvidia.com/mig-*`.
+
+. Check MIG strategy
++
+[source,bash]
+----
+$ oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.spec.mig.strategy}'
+single
+----
++
+**Solution**: Change pod resource request from `nvidia.com/mig-2g.10gb: 1` to `nvidia.com/gpu: 1`
+
+=== MIG Reconfiguration Takes Longer Than 20 Minutes
+
+**Symptom:** Node labeled 30+ minutes ago but `mig.config.state` still shows `pending`
+
+**Diagnosis:**
+
+. Check if node has running GPU workloads
++
+[source,bash]
+----
+$ oc get pods --all-namespaces -o wide | grep worker-gpu-0.example.com | grep -E "nvidia.com/(gpu|mig)"
+----
++
+If GPU workloads are running, MIG Manager cannot reconfigure until they terminate.
+
+. Drain node to evict GPU workloads
++
+[source,bash]
+----
+$ oc adm drain worker-gpu-0.example.com --ignore-daemonsets --delete-emptydir-data
+----
++
+Wait for pod eviction (5-10 minutes), then check `mig.config.state` again.
+
+. Check MIG Manager pod status
++
+[source,bash]
+----
+$ oc get pods -n nvidia-gpu-operator -l app=nvidia-mig-manager
+NAME                         READY   STATUS    RESTARTS   AGE
+nvidia-mig-manager-xxxxx     1/1     Running   3          45m
+----
++
+If `RESTARTS > 0` and increasing, check logs for crash loop:
++
+[source,bash]
+----
+$ oc logs -n nvidia-gpu-operator -l app=nvidia-mig-manager --previous
+----
+
+=== Verification: nvidia-smi Shows Different Instance Count Than Expected
+
+**Symptom:** Applied `all-1g.5gb` expecting 7 instances but `nvidia-smi -L` shows 3 instances
+
+**Possible Causes:**
+
+1. **Node has A100-80GB instead of A100-40GB**: Profile `all-1g.5gb` doesn't exist for 80GB model
+   - **Solution**: Use `all-1g.10gb` for A100-80GB GPUs
+
+2. **Custom ConfigMap overriding built-in profile**: Your ConfigMap may define `all-1g.5gb` differently
+   - **Solution**: Check ConfigMap content, remove custom definition to use auto-generated profile
+
+3. **MIG reconfiguration incomplete**: Device Plugin hasn't updated yet
+   - **Solution**: Wait full 20 minutes, check `nvidia.com/mig.config.state=success`
+
+== What's next
+
+In the next chapter, you will build a comprehensive observability stack by enabling OpenShift user-workload monitoring, exposing GPU metrics via DCGM Exporter, deploying the Grafana Operator, and creating custom dashboards to visualize **per-MIG-instance** telemetry alongside platform metrics.
+
+**Chapter 3 Preview**: You'll learn to:
+
+* Monitor individual MIG instance utilization (compute %, memory usage)
+* Correlate MIG instance metrics with pod resource requests
+* Create alerts for MIG instance oversubscription or underutilization
+* Visualize the ROI gains from MIG partitioning (utilization improvement from 33% to 78%)
+* Track MIG reconfiguration events and their impact on workload SLAs
+
+The observability stack validates the performance isolation and resource efficiency benefits you configured in this lab.
+
+////
+**MIG Slicing:**
+Update node labels to apply a MIG partitioning profile (e.g., `nvidia.com/mig.config=all-1g.10gb`). Create a custom `mig-parted` config resource file for specific hardware deployments. Verify that the sliced MIG instances are exposed as allocatable resources on the node.
+////
diff --git a/modules/ch2-mig/pages/s2-mig-slicing-lab.adoc b/modules/ch2-mig/pages/s2-mig-slicing-lab.adoc
index 5815fcc..ba91426 100644
--- a/modules/ch2-mig/pages/s2-mig-slicing-lab.adoc
+++ b/modules/ch2-mig/pages/s2-mig-slicing-lab.adoc
@@ -4,6 +4,11 @@
 
 _Estimated reading time: *{time_estimate} minutes*._
 
+[WARNING]
+====
+This lab in a work in progress. The content is being developed and may contain placeholders, incomplete information. Please refer to the final version for the complete lab experience.
+====
+
 Objective::
 
 Update node labels to apply MIG partitioning profiles, create custom mig-parted configuration resources for specific hardware deployments, and verify that sliced MIG instances are exposed as allocatable resources on GPU nodes.
@@ -526,226 +531,4 @@ With mixed strategy:
 All three teams share the same physical A100 GPU with hardware isolation, each getting appropriately-sized resources. This is the platform design from s1's multi-tenant scenario.
 
 Reference: xref:s1-mig-overview.adoc#_before_and_after_mig_resource_utilization[Multi-Tenant ROI Table in s1]
-====
-
-== Troubleshooting Common Issues
-
-=== MIG Configuration Not Applied After 20 Minutes
-
-**Symptom:** After labeling node, `oc describe node` still shows `nvidia.com/gpu: X` instead of `nvidia.com/mig-*` resources
-
-**Diagnosis Steps:**
-
-. Verify node label was applied correctly
-+
-[source,bash]
-----
-$ oc get node worker-gpu-0.example.com -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config}'
-all-1g.5gb
-----
-+
-If output is empty, the label wasn't applied. Re-apply with `--overwrite` flag.
-
-. Check MIG Manager logs for errors
-+
-[source,bash]
-----
-$ oc logs -n nvidia-gpu-operator -l app=nvidia-mig-manager --tail=50
-----
-+
-Look for:
-* `level=error` messages indicating validation failures
-* `MIG configuration complete` success message
-* `Applying MIG configuration` followed by profile name
-
-. Check MIG configuration state label
-+
-[source,bash]
-----
-$ oc get node worker-gpu-0.example.com -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'
-success
-----
-+
-Possible states:
-* `success`: MIG Manager completed configuration ✅
-* `pending`: Reconfiguration in progress (wait 10-20 min)
-* `failed`: Error occurred, check MIG Manager logs ❌
-
-. Verify GPU supports MIG mode
-+
-[source,bash]
-----
-$ oc debug node/worker-gpu-0.example.com
-sh-4.4# chroot /host
-sh-4.4# nvidia-smi -i 0 --query-gpu=mig.mode.current --format=csv
-mig.mode.current
-Enabled
-----
-+
-If shows `Disabled` after 20 minutes, GPU may not support MIG or MIG Manager failed to enable it.
-
-=== Custom ConfigMap Not Recognized
-
-**Symptom:** After creating custom ConfigMap and patching ClusterPolicy, custom profiles not available
-
-**Solution:**
-
-. Verify ConfigMap exists in correct namespace
-+
-[source,bash]
-----
-$ oc get configmap -n nvidia-gpu-operator custom-mig-parted-config
-NAME                        DATA   AGE
-custom-mig-parted-config    1      5m
-----
-
-. Verify ConfigMap is referenced in ClusterPolicy
-+
-[source,bash]
-----
-$ oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.spec.migManager.config.name}'
-custom-mig-parted-config
-----
-+
-If output is empty, the patch didn't apply. Re-run `oc patch clusterpolicy` command.
-
-. Restart MIG Manager pods to pick up new ConfigMap
-+
-[source,bash]
-----
-$ oc delete pods -n nvidia-gpu-operator -l app=nvidia-mig-manager
-pod "nvidia-mig-manager-xxxxx" deleted
-----
-+
-Pods will automatically recreate and load new ConfigMap.
-
-=== Pod Scheduling Fails: "Insufficient nvidia.com/mig-X"
-
-**Symptom:** Pod stuck in `Pending` state with event:
-
-----
-0/3 nodes are available: 3 Insufficient nvidia.com/mig-2g.10gb.
-----
-
-**Root Causes and Solutions:**
-
-**Cause 1: Nodes have different MIG profile**
-
-. Check what profiles are available on nodes
-+
-[source,bash]
-----
-$ oc get nodes -o custom-columns=NAME:.metadata.name,MIG-1G:.status.allocatable.nvidia\\.com/mig-1g\\.5gb,MIG-2G:.status.allocatable.nvidia\\.com/mig-2g\\.10gb
-NAME                       MIG-1G   MIG-2G
-worker-gpu-0.example.com   7        <none>
-worker-gpu-1.example.com   <none>   3
-----
-+
-**Solution**: Either:
-* Update pod resource request to match available profile (`nvidia.com/mig-1g.5gb: 1`)
-* Relabel node to desired profile (triggers 10-20 min reconfiguration)
-* Add nodeSelector to target specific node with correct profile
-
-**Cause 2: All MIG instances already allocated**
-
-. Check if MIG instances are exhausted
-+
-[source,bash]
-----
-$ oc describe node worker-gpu-0.example.com | grep -A 2 "Allocated resources:"
-Allocated resources:
-  nvidia.com/mig-1g.5gb:  7  (100% of capacity)
-----
-+
-**Solution**: Either:
-* Wait for running pods to complete and release instances
-* Add more GPU nodes to cluster
-* Scale down lower-priority workloads
-
-**Cause 3: Using single strategy with profile-specific request**
-
-If ClusterPolicy has `mig.strategy: single`, pods must request `nvidia.com/gpu: 1`, NOT `nvidia.com/mig-*`.
-
-. Check MIG strategy
-+
-[source,bash]
-----
-$ oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.spec.mig.strategy}'
-single
-----
-+
-**Solution**: Change pod resource request from `nvidia.com/mig-2g.10gb: 1` to `nvidia.com/gpu: 1`
-
-=== MIG Reconfiguration Takes Longer Than 20 Minutes
-
-**Symptom:** Node labeled 30+ minutes ago but `mig.config.state` still shows `pending`
-
-**Diagnosis:**
-
-. Check if node has running GPU workloads
-+
-[source,bash]
-----
-$ oc get pods --all-namespaces -o wide | grep worker-gpu-0.example.com | grep -E "nvidia.com/(gpu|mig)"
-----
-+
-If GPU workloads are running, MIG Manager cannot reconfigure until they terminate.
-
-. Drain node to evict GPU workloads
-+
-[source,bash]
-----
-$ oc adm drain worker-gpu-0.example.com --ignore-daemonsets --delete-emptydir-data
-----
-+
-Wait for pod eviction (5-10 minutes), then check `mig.config.state` again.
-
-. Check MIG Manager pod status
-+
-[source,bash]
-----
-$ oc get pods -n nvidia-gpu-operator -l app=nvidia-mig-manager
-NAME                         READY   STATUS    RESTARTS   AGE
-nvidia-mig-manager-xxxxx     1/1     Running   3          45m
-----
-+
-If `RESTARTS > 0` and increasing, check logs for crash loop:
-+
-[source,bash]
-----
-$ oc logs -n nvidia-gpu-operator -l app=nvidia-mig-manager --previous
-----
-
-=== Verification: nvidia-smi Shows Different Instance Count Than Expected
-
-**Symptom:** Applied `all-1g.5gb` expecting 7 instances but `nvidia-smi -L` shows 3 instances
-
-**Possible Causes:**
-
-1. **Node has A100-80GB instead of A100-40GB**: Profile `all-1g.5gb` doesn't exist for 80GB model
-   - **Solution**: Use `all-1g.10gb` for A100-80GB GPUs
-
-2. **Custom ConfigMap overriding built-in profile**: Your ConfigMap may define `all-1g.5gb` differently
-   - **Solution**: Check ConfigMap content, remove custom definition to use auto-generated profile
-
-3. **MIG reconfiguration incomplete**: Device Plugin hasn't updated yet
-   - **Solution**: Wait full 20 minutes, check `nvidia.com/mig.config.state=success`
-
-== What's next
-
-In the next chapter, you will build a comprehensive observability stack by enabling OpenShift user-workload monitoring, exposing GPU metrics via DCGM Exporter, deploying the Grafana Operator, and creating custom dashboards to visualize **per-MIG-instance** telemetry alongside platform metrics.
-
-**Chapter 3 Preview**: You'll learn to:
-
-* Monitor individual MIG instance utilization (compute %, memory usage)
-* Correlate MIG instance metrics with pod resource requests
-* Create alerts for MIG instance oversubscription or underutilization
-* Visualize the ROI gains from MIG partitioning (utilization improvement from 33% to 78%)
-* Track MIG reconfiguration events and their impact on workload SLAs
-
-The observability stack validates the performance isolation and resource efficiency benefits you configured in this lab.
-
-////
-**MIG Slicing:**
-Update node labels to apply a MIG partitioning profile (e.g., `nvidia.com/mig.config=all-1g.10gb`). Create a custom `mig-parted` config resource file for specific hardware deployments. Verify that the sliced MIG instances are exposed as allocatable resources on the node.
-////
+====
\ No newline at end of file
diff --git a/modules/ch3-observability/pages/s2-expose-metrics-lab.adoc b/modules/ch3-observability/pages/s2-expose-metrics-lab.adoc
index 35a4133..3c2cb37 100644
--- a/modules/ch3-observability/pages/s2-expose-metrics-lab.adoc
+++ b/modules/ch3-observability/pages/s2-expose-metrics-lab.adoc
@@ -4,6 +4,11 @@
 
 _Estimated reading time: *{time_estimate} minutes*._
 
+[WARNING]
+====
+This lab in a work in progress. The content is being developed and may contain placeholders, incomplete information. Please refer to the final version for the complete lab experience.
+====
+
 Objective::
 
 Enable OpenShift user-workload monitoring by modifying the cluster-monitoring-config ConfigMap, and label the GPU Operator namespace to expose GPU telemetry for Prometheus scraping via the NVIDIA DCGM Exporter.
diff --git a/modules/ch3-observability/pages/s3-grafana-setup-lab.adoc b/modules/ch3-observability/pages/s3-grafana-setup-lab.adoc
index f05dd0c..61bc837 100644
--- a/modules/ch3-observability/pages/s3-grafana-setup-lab.adoc
+++ b/modules/ch3-observability/pages/s3-grafana-setup-lab.adoc
@@ -4,6 +4,11 @@
 
 _Estimated reading time: *{time_estimate} minutes*._
 
+[WARNING]
+====
+This lab in a work in progress. The content is being developed and may contain placeholders, incomplete information. Please refer to the final version for the complete lab experience.
+====
+
 Objective::
 
 Install the Grafana Operator from OperatorHub, create a Grafana instance Custom Resource, and set up secure access to cluster metrics by creating a ServiceAccount with cluster-monitoring-view permissions and configuring a GrafanaDatasource pointing to the Thanos Querier endpoint.