From ce86353e60f7ab3ec367709e051b5d2d8ec69e6d Mon Sep 17 00:00:00 2001 From: windsonsea Date: Mon, 16 Mar 2026 10:25:55 +0800 Subject: [PATCH] Update heading levels in get-started/deploy-with-helm.md Signed-off-by: windsonsea --- docs/get-started/deploy-with-helm.md | 84 ++++++++----------- .../current/get-started/deploy-with-helm.md | 38 ++++----- 2 files changed, 54 insertions(+), 68 deletions(-) diff --git a/docs/get-started/deploy-with-helm.md b/docs/get-started/deploy-with-helm.md index 80781bbd..561ffa90 100644 --- a/docs/get-started/deploy-with-helm.md +++ b/docs/get-started/deploy-with-helm.md @@ -2,37 +2,33 @@ title: Deploy HAMi using Helm --- -This guide will cover: +This guide covers: -- Configure nvidia container runtime in each GPU nodes -- Install HAMi using helm -- Launch a vGPU task -- Check if the corresponding device resources are limited inside container +- Configuring NVIDIA container runtime on each GPU node +- Deploying HAMi using Helm +- Launching a vGPU task +- Verifying container resource limits ## Prerequisites {#prerequisites} -- [Helm](https://helm.sh/zh/docs/) version v3+ -- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) version v1.16+ -- [CUDA](https://developer.nvidia.com/cuda-toolkit) version v10.2+ -- [NvidiaDriver](https://www.nvidia.cn/drivers/unix/) v440+ +- [Helm](https://helm.sh/zh/docs/) v3+ +- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) v1.16+ +- [CUDA](https://developer.nvidia.com/cuda-toolkit) v10.2+ +- [NVIDIA Driver](https://www.nvidia.cn/drivers/unix/) v440+ ## Installation {#installation} -### Configure nvidia-container-toolkit {#configure-nvidia-container-toolkit} +### 1. Configure nvidia-container-toolkit {#configure-nvidia-container-toolkit} - Configure nvidia-container-toolkit +Perform the following steps on all GPU nodes. -Execute the following steps on all your GPU nodes. +This guide assumes that NVIDIA drivers and the `nvidia-container-toolkit` are already installed, and that `nvidia-container-runtime` is set as the default low-level runtime. -This README assumes pre-installation of NVIDIA drivers and the -`nvidia-container-toolkit`. Additionally, it assumes configuration of the -`nvidia-container-runtime` as the default low-level runtime. +See [nvidia-container-toolkit installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). -Please see: [nvidia-container-toolkit install-guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) +The following example applies to Debian-based systems using Docker or containerd: -#### Example for debian-based systems with `Docker` and `containerd` {#example-for-debian-based-systems-with-docker-and-containerd} - -##### Install the `nvidia-container-toolkit` {#install-the-nvidia-container-toolkit} +#### Install the `nvidia-container-toolkit` {#install-the-nvidia-container-toolkit} ```bash distribution=$(. /etc/os-release;echo $ID$VERSION_ID) @@ -43,11 +39,9 @@ curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia- sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit ``` -##### Configure `Docker` {#configure-docker} +#### Configure Docker {#configure-docker} -When running `Kubernetes` with `Docker`, edit the configuration file, -typically located at `/etc/docker/daemon.json`, to set up -`nvidia-container-runtime` as the default low-level runtime: +When running Kubernetes with Docker, edit the configuration file (usually `/etc/docker/daemon.json`) to set `nvidia-container-runtime` as the default runtime: ```json { @@ -61,17 +55,15 @@ typically located at `/etc/docker/daemon.json`, to set up } ``` -And then restart `Docker`: +Restart Docker: ```bash sudo systemctl daemon-reload && systemctl restart docker ``` -##### Configure `containerd` {#configure-containerd} +#### Configure containerd {#configure-containerd} -When running `Kubernetes` with `containerd`, modify the configuration file -typically located at `/etc/containerd/config.toml`, to set up -`nvidia-container-runtime` as the default low-level runtime: +When using Kubernetes with containerd, modify the configuration file (usually `/etc/containerd/config.toml`) to set `nvidia-container-runtime` as the default runtime: ```toml version = 2 @@ -90,38 +82,35 @@ version = 2 BinaryName = "/usr/bin/nvidia-container-runtime" ``` -And then restart `containerd`: +Restart containerd: ```bash sudo systemctl daemon-reload && systemctl restart containerd ``` -#### 2. Label your nodes {#label-your-nodes} +### 2. Label your nodes {#label-your-nodes} -Label your GPU nodes for scheduling with HAMi by adding the label "gpu=on". -Without this label, the nodes cannot be managed by our scheduler. +Label your GPU nodes for HAMi scheduling with `gpu=on`. Nodes without this label cannot be managed by the scheduler. ```bash kubectl label nodes {nodeid} gpu=on ``` -#### 3. Deploy HAMi using Helm {#deploy-hami-using-helm} +### 3. Deploy HAMi using Helm {#deploy-hami-using-helm} -First, you need to check your Kubernetes version by using the following command: +Check your Kubernetes version: ```bash kubectl version ``` -Then, add our repo in helm +Add the Helm repository: ```bash helm repo add hami-charts https://project-hami.github.io/HAMi/ ``` -During installation, set the Kubernetes scheduler image version to match your -Kubernetes server version. For instance, if your cluster server version is -1.16.8, use the following command for deployment: +During installation, set the Kubernetes scheduler image to match your cluster version. For example, if your cluster version is 1.16.8: ```bash helm install hami hami-charts/hami \ @@ -129,14 +118,13 @@ helm install hami hami-charts/hami \ -n kube-system ``` -If everything goes well, you will see both vgpu-device-plugin and vgpu-scheduler pods are in the Running state +If successful, both `vgpu-device-plugin` and `vgpu-scheduler` pods should be in the `Running` state. -### Demo {#demo} +## Demo {#demo} -#### 1. Submit demo task {#submit-demo-task} +### 1. Submit demo task {#submit-demo-task} -Containers can now request NVIDIA vGPUs using the `nvidia.com/gpu` resource -type. +Containers can now request NVIDIA vGPUs using the `nvidia.com/gpu` resource type. ```yaml apiVersion: v1 @@ -150,19 +138,19 @@ spec: command: ["bash", "-c", "sleep 86400"] resources: limits: - nvidia.com/gpu: 1 # requesting 1 vGPUs - nvidia.com/gpumem: 10240 # Each vGPU contains 10240m device memory (Optional,Integer) + nvidia.com/gpu: 1 # Request 1 vGPU + nvidia.com/gpumem: 10240 # Each vGPU provides 10240 MiB device memory (optional) ``` -#### 2. Verify in container resource control {#verify-in-container-resource-control} +### 2. Verify container resource limits {#verify-in-container-resource-control} -Execute the following query command: +Run the following command: ```bash kubectl exec -it gpu-pod nvidia-smi ``` -The result should be: +Expected output: ```text [HAMI-core Msg(28:140561996502848:libvgpu.c:836)]: Initializing..... diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/get-started/deploy-with-helm.md b/i18n/zh/docusaurus-plugin-content-docs/current/get-started/deploy-with-helm.md index 7aa5ea89..a9f81091 100644 --- a/i18n/zh/docusaurus-plugin-content-docs/current/get-started/deploy-with-helm.md +++ b/i18n/zh/docusaurus-plugin-content-docs/current/get-started/deploy-with-helm.md @@ -5,7 +5,7 @@ title: 使用 Helm 部署 HAMi 本指南将涵盖: - 为每个 GPU 节点配置 NVIDIA 容器运行时 -- 使用 Helm 安装 HAMi +- 使用 Helm 部署 HAMi - 启动 vGPU 任务 - 验证容器内设备资源是否受限 @@ -16,21 +16,19 @@ title: 使用 Helm 部署 HAMi - [CUDA](https://developer.nvidia.com/cuda-toolkit) v10.2+ - [NVIDIA 驱动](https://www.nvidia.cn/drivers/unix/) v440+ -## 安装步骤 {#installation} +## 安装步骤 {#installation} -### 配置 nvidia-container-toolkit {#configure-nvidia-container-toolkit} +### 1. 配置 nvidia-container-toolkit {#configure-nvidia-container-toolkit} - 配置 nvidia-container-toolkit +在所有 GPU 节点执行此操作。 -在所有 GPU 节点执行以下操作。 - -本文档假设已预装 NVIDIA 驱动和 `nvidia-container-toolkit`,并已将 `nvidia-container-runtime` 配置为默认底层运行时。 +本文假设已预装 NVIDIA 驱动和 `nvidia-container-toolkit`,并已将 `nvidia-container-runtime` 配置为默认底层运行时。 参考:[nvidia-container-toolkit 安装指南](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) -#### 基于 Debian 系统(使用 `Docker` 和 `containerd`)示例 {#example-for-debian-based-systems-with-docker-and-containerd} +以下是基于 Debian 系统(使用 Docker 和 containerd)的示例: -##### 安装 `nvidia-container-toolkit` {#install-the-nvidia-container-toolkit} +#### 安装 `nvidia-container-toolkit` {#install-the-nvidia-container-toolkit} ```bash distribution=$(. /etc/os-release;echo $ID$VERSION_ID) @@ -41,9 +39,9 @@ curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia- sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit ``` -##### 配置 `Docker` {#configure-docker} +#### 配置 Docker {#configure-docker} -当使用 `Docker` 运行 `Kubernetes` 时,编辑配置文件(通常位于 `/etc/docker/daemon.json`),将 +当使用 Docker 运行 Kubernetes 时,编辑配置文件(通常位于 `/etc/docker/daemon.json`),将 `nvidia-container-runtime` 设为默认底层运行时: ```json @@ -58,15 +56,15 @@ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit } ``` -然后重启 `Docker`: +然后重启 Docker: ```bash sudo systemctl daemon-reload && systemctl restart docker ``` -##### 配置 `containerd` {#configure-containerd} +#### 配置 containerd {#configure-containerd} -当使用 `containerd` 运行 `Kubernetes` 时,修改配置文件(通常位于 `/etc/containerd/config.toml`),将 +当使用 containerd 运行 Kubernetes 时,修改配置文件(通常位于 `/etc/containerd/config.toml`),将 `nvidia-container-runtime` 设为默认底层运行时: ```toml @@ -86,13 +84,13 @@ version = 2 BinaryName = "/usr/bin/nvidia-container-runtime" ``` -然后重启 `containerd`: +然后重启 containerd: ```bash sudo systemctl daemon-reload && systemctl restart containerd ``` -#### 2. 标记节点 {#label-your-nodes} +### 2. 标记节点 {#label-your-nodes} 通过添加 "gpu=on" 标签将 GPU 节点标记为可调度 HAMi 任务。未标记的节点将无法被调度器管理。 @@ -100,7 +98,7 @@ sudo systemctl daemon-reload && systemctl restart containerd kubectl label nodes {节点ID} gpu=on ``` -#### 3. 使用 Helm 部署 HAMi {#deploy-hami-using-helm} +### 3. 使用 Helm 部署 HAMi {#deploy-hami-using-helm} 首先通过以下命令确认 Kubernetes 版本: @@ -124,9 +122,9 @@ helm install hami hami-charts/hami \ 若一切正常,可见 vgpu-device-plugin 和 vgpu-scheduler 的 Pod 均处于 Running 状态。 -### 演示 {#demo} +## 演示 {#demo} -#### 1. 提交演示任务 {#submit-demo-task} +### 1. 提交演示任务 {#submit-demo-task} 容器现在可通过 `nvidia.com/gpu` 资源类型申请 NVIDIA vGPU: @@ -146,7 +144,7 @@ spec: nvidia.com/gpumem: 10240 # 每个 vGPU 包含 10240m 设备显存(可选,整型) ``` -#### 2. 验证容器内资源限制 {#verify-in-container-resource-control} +### 2. 验证容器内资源限制 {#verify-in-container-resource-control} 执行查询命令: