Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -271,4 +271,4 @@ For example:

## Example

See the [example file](./examples/defalt_use.md) for a complete working example.
See the [example file](./examples/defalt-use.md) for a complete working example.
8 changes: 8 additions & 0 deletions i18n/zh/docusaurus-plugin-content-docs/current.json
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,10 @@
"message": "管理 AWS Neuron 设备",
"description": "The label for category 'Managing AWS Neuron devices' in sidebar 'docs'"
},
"sidebar.docs.category.Managing Vastai devices": {
"message": "管理 Vastai 设备",
"description": "The label for category 'Managing Vastai devices' in sidebar 'docs'"
},
"sidebar.docs.category.Optimize Kunlunxin devices scheduling": {
"message": "优化昆仑芯设备调度",
"description": "The label for category 'Optimize Kunlunxin devices scheduling' in sidebar 'docs'"
Expand Down Expand Up @@ -119,6 +123,10 @@
"message": "示例",
"description": "The label for category 'Examples' in sidebar 'docs'"
},
"sidebar.docs.category.vastai-examples": {
"message": "示例",
"description": "The label for category 'Examples' in sidebar 'docs'"
},
"sidebar.docs.category.metax-sgpu-examples": {
"message": "示例",
"description": "The label for category 'Examples' in sidebar 'docs'"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@ title: 启用 AWS-Neuron 设备共享
linktitle: AWS-Neuron 共享
---

## 启用 AWS-Neuron 设备共享

AWS Neuron 设备是 AWS 专为机器学习工作负载设计的硬件加速器,特别针对深度学习推理和训练场景进行了优化。这些设备属于 AWS Inferentia 和 Trainium 产品家族,可在 AWS 云上为 AI 应用提供高性能、高性价比且可扩展的解决方案。

HAMi 现已集成[my-scheduler](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html#deploy-neuron-scheduler-extension),提供以下核心功能:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ translated: true

## 开启 MLU 复用

* 通过 helm 部署本组件,参照[主文档中的开启 vgpu 支持章节](https://github.com/Project-HAMi/HAMi/blob/master/README_cn.md#kubernetes开启vgpu支持)
* 通过 helm 部署本组件,参照[主文档中的开启 vgpu 支持章节](https://github.com/Project-HAMi/HAMi/blob/master/readme_cn.md#kubernetes开启vgpu支持)

* 使用以下指令,为 MLU 节点打上 label

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -138,8 +138,8 @@ spec:

该 ClusterQueue 跟踪以下资源:

* `nvidia.com/total-gpucores`:所有 vGPU 的 GPU 核心总量(每个单位表示 1% 的 GPU 核心)
* `nvidia.com/total-gpumem`:所有 vGPU 的 GPU 显存总量(单位为 MiB)
- `nvidia.com/total-gpucores`:所有 vGPU 的 GPU 核心总量(每个单位表示 1% 的 GPU 核心)
- `nvidia.com/total-gpumem`:所有 vGPU 的 GPU 显存总量(单位为 MiB)

### 创建 LocalQueue

Expand Down Expand Up @@ -195,10 +195,10 @@ spec:

在该示例中:

* `kueue.x-k8s.io/queue-name` 标签将 Deployment 关联到 `hami-local-queue`
* `nvidia.com/gpu: 1` 请求 1 个 vGPU
* `nvidia.com/gpucores: 50` 为每个 vGPU 请求 50% 的 GPU 核心
* `nvidia.com/gpumem: 1024` 为每个 vGPU 请求 1024 MiB 的 GPU 显存
- `kueue.x-k8s.io/queue-name` 标签将 Deployment 关联到 `hami-local-queue`
- `nvidia.com/gpu: 1` 请求 1 个 vGPU
- `nvidia.com/gpucores: 50` 为每个 vGPU 请求 50% 的 GPU 核心
- `nvidia.com/gpumem: 1024` 为每个 vGPU 请求 1024 MiB 的 GPU 显存

### 请求多个 vGPU 的 Deployment

Expand Down Expand Up @@ -260,14 +260,14 @@ status:

Kueue 的 ResourceTransformation 会自动转换 HAMi vGPU 的资源请求:

* `nvidia.com/gpu` × `nvidia.com/gpucores` → `nvidia.com/total-gpucores`
* `nvidia.com/gpu` × `nvidia.com/gpumem` → `nvidia.com/total-gpumem`
- `nvidia.com/gpu` × `nvidia.com/gpucores` → `nvidia.com/total-gpucores`
- `nvidia.com/gpu` × `nvidia.com/gpumem` → `nvidia.com/total-gpumem`

例如:

* 一个 Deployment 有 2 个副本,每个副本请求 `nvidia.com/gpu: 1`、`nvidia.com/gpucores: 50` 和 `nvidia.com/gpumem: 1024`
* 将消耗:`nvidia.com/total-gpucores: 100`(2 个副本 × 1 GPU × 50 核心)以及 `nvidia.com/total-gpumem: 2048`(2 个副本 × 1 GPU × 1024 MiB)
- 一个 Deployment 有 2 个副本,每个副本请求 `nvidia.com/gpu: 1`、`nvidia.com/gpucores: 50` 和 `nvidia.com/gpumem: 1024`
- 将消耗:`nvidia.com/total-gpucores: 100`(2 个副本 × 1 GPU × 50 核心)以及 `nvidia.com/total-gpumem: 2048`(2 个副本 × 1 GPU × 1024 MiB)

## 示例

完整可运行示例请参见 [示例文件](./examples/defalt_use.md)。
完整可运行示例请参见 [示例文件](./examples/defalt-use.md)。
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,14 @@ translated: true

### 需求

* Metax Driver >= 2.32.0
* Metax GPU Operator >= 0.10.2
* Kubernetes >= 1.23
- Metax Driver >= 2.32.0
- Metax GPU Operator >= 0.10.2
- Kubernetes >= 1.23

### 开启复用沐曦设备

* 部署 Metax GPU Operator (请联系您的设备提供方获取)
* 根据 readme.md 部署 HAMi
- 部署 Metax GPU Operator (请联系您的设备提供方获取)
- 根据 README.md 部署 HAMi

### 运行沐曦任务

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ title: Prerequisites

Execute the following steps on all your GPU nodes.

This README assumes pre-installation of NVIDIA drivers and the `nvidia-container-toolkit`. Additionally, it assumes configuration of the `nvidia-container-runtime` as the default low-level runtime.
This readme assumes pre-installation of NVIDIA drivers and the `nvidia-container-toolkit`. Additionally, it assumes configuration of the `nvidia-container-runtime` as the default low-level runtime.

Please see: [https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)

Expand Down Expand Up @@ -47,17 +47,16 @@ When running `Kubernetes` with `Docker`, edit the configuration file, typically

And then restart `Docker`:

```
```bash
sudo systemctl daemon-reload && systemctl restart docker
```


#### Configure `containerd`

When running `Kubernetes` with `containerd`, modify the configuration file typically located at `/etc/containerd/config.toml`, to set up
`nvidia-container-runtime` as the default low-level runtime:

```
```yaml
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
Expand All @@ -76,14 +75,14 @@ version = 2

And then restart `containerd`:

```
```bash
sudo systemctl daemon-reload && systemctl restart containerd
```

### Label your nodes

Label your GPU nodes for scheduling with HAMi by adding the label "gpu=on". Without this label, the nodes cannot be managed by our scheduler.

```
```bash
kubectl label nodes {nodeid} gpu=on
```
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ title: WebUI Installation

This topic includes instructions for installing and running HAMi-WebUI on Kubernetes using Helm Charts.

The WebUI can only be accessed by your localhost, so you need to connect your localhost to the cluster by configuring `~/.kube/config`
The WebUI can only be accessed by your localhost, so you need to connect your localhost to the cluster by configuring `~/.kube/config`

[Helm](https://helm.sh/) is an open-source command line tool used for managing Kubernetes applications. It is a graduate project in the [CNCF Landscape](https://www.cncf.io/projects/helm/).

Expand Down Expand Up @@ -58,7 +58,6 @@ To set up the HAMi-WebUI Helm repository so that you download the correct HAMi-W

1. Configure ~/.kube/config in your localhost to be able to connect your cluster.


2. Run the following command to do a port-forwarding of the HAMi-WebUI service on port `3000` in your localhost.

```bash
Expand Down Expand Up @@ -88,7 +87,6 @@ kubectl logs --namespace=hami deploy/my-hami-webui -c hami-webui-be-oss

For more information about accessing Kubernetes application logs, refer to [Pods](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#interacting-with-running-pods) and [Deployments](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#interacting-with-deployments-and-services).


## Uninstall the HAMi-WebUI deployment

To uninstall the HAMi-WebUI deployment, run the command:
Expand All @@ -105,4 +103,4 @@ If you want to delete the namespace `hami`, then run the command:

```bash
kubectl delete namespace hami
```
```
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@ translated: true

## 介绍

**我们目前支持复用沐曦GPU设备,提供与vGPU类似的复用功能**,包括:
**我们目前支持复用沐曦 GPU 设备,提供与 vGPU 类似的复用功能**,包括:

***GPU 共享***: 每个任务可以只占用一部分显卡,多个任务可以共享一张显卡

***可限制分配的显存大小***: 你现在可以用显存值(例如4G)来分配GPU,本组件会确保任务使用的显存不会超过分配数值
***可限制分配的显存大小***: 你现在可以用显存值(例如 4G)来分配 GPU,本组件会确保任务使用的显存不会超过分配数值

***可限制计算单元数量***: 你现在可以指定任务使用的算力比例(例如60即代表使用60%算力)来分配GPU,本组件会确保任务使用的算力不会超过分配数值
***可限制计算单元数量***: 你现在可以指定任务使用的算力比例(例如 60 即代表使用 60% 算力)来分配 GPU,本组件会确保任务使用的算力不会超过分配数值

### 需求

Expand All @@ -21,8 +21,8 @@ translated: true

### 开启复用沐曦设备

* 部署Metax GPU Operator (请联系您的设备提供方获取)
* 根据readme.md部署HAMi
* 部署 Metax GPU Operator (请联系您的设备提供方获取)
* 根据 README.md 部署 HAMi

### 运行沐曦任务

Expand Down
Loading
Loading