Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 122 additions & 0 deletions versioned_docs/version-v2.8.0/get-started/verify-hami.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
---
title: Validate HAMi Setup and vGPU Behavior
sidebar_label: Validate HAMi
---

# Validate HAMi Setup and vGPU Behavior

This guide provides a rapid, end-to-end setup to validate that GPU workloads run correctly in a Kubernetes cluster with HAMi.

What "working" actually means: A successful HAMi setup goes beyond just running pods or a successful Helm installation. It means the GPU is accessible inside a container, Kubernetes correctly advertises the resources, and vGPU isolation (like memory limits) behaves predictably.

## Step 0: Configure Node Container Runtime (If not already done)
HAMi requires the `nvidia-container-toolkit` to be installed and set as the default low-level runtime on all your GPU nodes.

### 1. Install nvidia-container-toolkit (Debian/Ubuntu example)
```
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
| sudo tee /etc/apt/sources.list.d/libnvidia-container.list
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
```

### 2. Configure your runtime
* For containerd: Edit `/etc/containerd/config.toml` to set the default runtime name to `"nvidia"` and the binary name to `"/usr/bin/nvidia-container-runtime"`.
* Restart: `sudo systemctl daemon-reload && systemctl restart containerd`
* For Docker: Edit `/etc/docker/daemon.json` to set `"default-runtime": "nvidia"`.
* Restart: `sudo systemctl daemon-reload && systemctl restart docker`

## Step 1: Validate the Native GPU Stack (Crucial Pre-flight Check)
Before installing HAMi, you must prove that Kubernetes can natively access the GPU.

This step validates your GPU stack independently of HAMi.

### 1. Deploy a native test pod
```
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-test
spec:
restartPolicy: Never
containers:
- name: cuda
image: nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
EOF
```
Expected: You see valid `nvidia-smi` output. If this fails, do NOT continue. Fix your GPU setup first.

### 2. Verify execution
```
kubectl wait --for=condition=Succeeded pod/cuda-test --timeout=60s
kubectl logs cuda-test
```
Note: You must see the standard `nvidia-smi` output. Do not proceed if this fails.

## Step 2: Install HAMi
Once the baseline is verified, label your node so the HAMi scheduler can manage it, and deploy via Helm.

### 1. Label the node
```
kubectl label nodes $(hostname) gpu=on --overwrite
```

### 2. Deploy using Helm
```
helm repo add hami-charts https://project-hami.github.io/HAMi/
helm install hami hami-charts/hami -n kube-system
```

### 3. Verify components
```
kubectl get pods -n kube-system | grep hami
```
Expected: Both `hami-scheduler` and `vgpu-device-plugin` pods should be in the `Running` state.

## Step 3: Launch and Verify a vGPU Task
Let's prove HAMi is enforcing fractional resource limits (vGPU).

### 1. Submit a vGPU demo task
```
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: ubuntu-container
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 10240
EOF
```

### 2. Verify resource control inside the container
```
kubectl wait --for=condition=Ready pod/gpu-pod --timeout=60s
kubectl exec -it gpu-pod -- nvidia-smi
```
Expected: You will see the `[HAMI-core Msg...]` initialization lines, and the `nvidia-smi` table will show exactly `10240MiB` of Total Memory, proving vGPU isolation is active.

## Troubleshooting Order
If you encounter issues, follow this sequence:
1. Hardware/Drivers: Run `nvidia-smi` directly on the host.
2. Container Runtime: Ensure `sudo ctr run` or `docker run` works outside K8s.
3. Stale Plugins: Remove conflicting plugins: `kubectl delete daemonset nvidia-device-plugin-daemonset -n kube-system --ignore-not-found`.
4. Node Resources: Verify K8s sees the GPU: `kubectl get nodes -o jsonpath='{.items[*].status.allocatable}' | grep -i nvidia`.
5. Scheduler Layer: Check HAMi logs: `kubectl logs -n kube-system -l app=hami-scheduler`.

## Cleanup
```
kubectl delete pod cuda-test gpu-pod --ignore-not-found
```
3 changes: 2 additions & 1 deletion versioned_sidebars/version-v2.8.0-sidebars.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@
"type": "category",
"label": "Get Started",
"items": [
"get-started/deploy-with-helm"
"get-started/deploy-with-helm",
"get-started/verify-hami"
]
},
{
Expand Down