Skip to content

use device serial number to construct id#17

Merged
archlitchi merged 2 commits intoProject-HAMi:masterfrom
peachest:fix/uuid
Aug 7, 2025
Merged

use device serial number to construct id#17
archlitchi merged 2 commits intoProject-HAMi:masterfrom
peachest:fix/uuid

Conversation

@peachest
Copy link
Copy Markdown

fix #16

The device id in the device information reported and registered to the node by the current dcu-vgpu-device-plugin is constructed by adding the device index to the DCU- prefix.

If there is only one DCU node in the cluster, there is no effect on HAMi scheduling. However, if there are multiple DCU nodes in the cluster, the scheduler will not be able to schedule properly, for example, use use-uuid to limit the scheduling range.

The dcu-vgpu device plugin originally integrates dcgm to get device information. At the same time, dcgm provides a ShowSerialNumber function to get the serial number of the device based on the specified device index. Usage:

deviceSerialInfos, err := dcgm.ShowUId([]int{0, 1, 2, 3, 4, 5, 6, 7})

根据查询出来的序列号构造出设备 id,格式为 DCU-<serial number>。影响范围包括如下两点

  1. 设备插件注册到节点上的 Annotation,以及
  2. 为容器分配设备时需要根据序列号反向找回对应的设备序号,以挂载正确的设备文件

测试结果如下:

根据日志时间戳,调用 dcgm.ShowSerialNumber 的时间不超过 1s,时间开销少。
Pasted image 20250717194850

节点 Annotation 如下:
Pasted image 20250717195124

在一个 8 卡集群中申请 0~3 号这 4 张卡,调度成功的 Pod Annotation 如下:
image

容器内部实际挂载设备:
Pasted image 20250717205239

结论:可以在不修改调度器的情况下正常调度,并且可以正常使用根据 uuid 指定调度范围的特性。

@hami-robot
Copy link
Copy Markdown

hami-robot bot commented Jul 18, 2025

Welcome @peachest! It looks like this is your first PR to Project-HAMi/dcu-vgpu-device-plugin 🎉

houyuxi added 2 commits July 18, 2025 11:01
…anism

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
@Nimbus318
Copy link
Copy Markdown

Thanks for the contribution! Could you also add an example for specifying UUID in the following documentation paths?

You can refer to the NVIDIA examples below for formatting and content:

No need to update anything in versioned_docs; just adding it under the current docs will ensure it's visible in the latest (Next) version. Thanks!
CleanShot 2025-07-18 at 16 50 59@2x

@peachest
Copy link
Copy Markdown
Author

Thanks for the contribution! Could you also add an example for specifying UUID in the following documentation paths?感谢您的贡献!您能否在以下文档路径中添加一个指定 UUID 的示例?

You can refer to the NVIDIA examples below for formatting and content:您可以参考下面的 NVIDIA 示例来了解格式和内容:

No need to update anything in versioned_docs; just adding it under the current docs will ensure it's visible in the latest (Next) version. Thanks!无需在 versioned_docs 中更新任何内容;只需将其添加到当前文档下即可确保它在最新(Next)版本中可见。谢谢! CleanShot 2025-07-18 at 16 50 59@2x

The corresponding PR Project-HAMi/website#87 has been submitted, please check it out

@peachest
Copy link
Copy Markdown
Author

According to the metrics returned by the official dcu-exporter provided by hygon, the device id is obviously also using the serial number.
This further illustrates the correctness of using serial numbers to construct dcu device ids in HAMi.

image

Copy link
Copy Markdown
Member

@archlitchi archlitchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@hami-robot
Copy link
Copy Markdown

hami-robot bot commented Aug 7, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: archlitchi, peachest

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@archlitchi archlitchi merged commit 8160fc0 into Project-HAMi:master Aug 7, 2025
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use device uid instead of constructing fake id with index

3 participants