diff --git a/versioned_docs/version-v2.8.0/get-started/verify-hami.md b/versioned_docs/version-v2.8.0/get-started/verify-hami.md new file mode 100644 index 00000000..4c8f8c9f --- /dev/null +++ b/versioned_docs/version-v2.8.0/get-started/verify-hami.md @@ -0,0 +1,127 @@ +--- +title: Validate HAMi Setup and vGPU Behavior +sidebar_label: Validate HAMi +--- +# Validate HAMi Setup and vGPU Behavior + +## Scope and Assumptions + +This guide assumes that HAMi is already installed (for example, via the "Deploy HAMi using Helm" guide in the Get Started section). + +The goal of this document is not to repeat installation steps, but to validate that HAMi is working correctly in a real Kubernetes environment, including GPU access and vGPU behavior. + +If HAMi is not yet installed, please follow the deployment guide first. + +## Step 0: Configure Node Container Runtime (If not already done) +HAMi requires the `nvidia-container-toolkit` to be installed and set as the default low-level runtime on all your GPU nodes. + +### 1. Install nvidia-container-toolkit (Debian/Ubuntu example) +``` +distribution=$(. /etc/os-release;echo $ID$VERSION_ID) +curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \ + | sudo tee /etc/apt/sources.list.d/libnvidia-container.list +curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - +sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit +``` + +### 2. Configure your runtime +* For containerd: Edit `/etc/containerd/config.toml` to set the default runtime name to `"nvidia"` and the binary name to `"/usr/bin/nvidia-container-runtime"`. + * Restart: `sudo systemctl daemon-reload && systemctl restart containerd` +* For Docker: Edit `/etc/docker/daemon.json` to set `"default-runtime": "nvidia"`. + * Restart: `sudo systemctl daemon-reload && systemctl restart docker` + +## Step 1: Validate the Native GPU Stack (Crucial Pre-flight Check) +Before installing HAMi, you must prove that Kubernetes can natively access the GPU. + +This step validates your GPU stack independently of HAMi. + +### 1. Deploy a native test pod +``` +cat <