-
Notifications
You must be signed in to change notification settings - Fork 175
feat(ascend): initial Ascend backend and add elementwise add op #564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
chenghuaWang
merged 16 commits into
UbiquitousLearning:main
from
lywbarca:lyw/ascend-backend
Dec 23, 2025
Merged
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
0ffe938
feat(ascend): add simple Ascend add demo
70d98a9
feat(ascend memory): introduce memory pool to Ascend backend
1d6bd24
feat(ascend backend): create Ascend backend runtime, allocator and di…
bf060a9
feat(ascend): add Ascend elementwise ops
a618fd8
fix(ascend):add enum for Ascend
eb52ca8
feat(ascend): create for Ascend
d44f81c
fix(ascend): fix critical issues from CodeRabbit review
e25dd34
feat(ascned):add core design document of ascend backend
e9e6842
fix(ascend): add result validation and timing measurement
852406a
fix(ascend): use the common code path of setup
f82e1ac
fix(ascend): fix some problem of document
beeabed
feat(ascend): add a X2X op for transmitting tensor from cpu to npu or…
add14d1
fix(ascend): create the test part
6b88b8d
fix(ascend): move to the test folder
2d25eda
Update build_arm_ascend.yaml
chenghuaWang 2fa8eed
fix(ascend): address review comments
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,191 @@ | ||
| Ascend Backend | ||
| ======================== | ||
|
|
||
| 总览 | ||
| ---- | ||
| Ascend Backend 将 mLLM 的算子执行能力接入华为 Ascend NPU,提供端到端的调度、内存管理与算子生命周期管理,使模型在 Ascend 上高效运行。 | ||
|
|
||
| 设计目标 | ||
| -------- | ||
| - 统一后端:作为 mLLM 原生后端,统一接口与调度流程。 | ||
| - ATB 单算子验证:打通算子从框架到 NPU 的完整链路。 | ||
| - 生命周期管理:算子创建、准备、执行、销毁的统一抽象。 | ||
| - 内存管理:专用 Ascend 设备内存池,减少反复申请释放。 | ||
| - 扩展性:便于新增算子、执行模式和性能优化。 | ||
|
|
||
| 架构组件 | ||
| -------- | ||
|
|
||
| 架构图如下: | ||
|
|
||
| :: | ||
|
|
||
| ┌─────────────────────────────────────────────────────────────┐ | ||
| │ MLLM 框架 │ | ||
| │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ | ||
| │ │ 模块 │ │ 层 │ │ 调度器 │ │ | ||
| │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ | ||
| └─────────┼─────────────────┼─────────────────┼───────────────┘ | ||
| │ │ │ | ||
| └─────────────────┴─────────────────┘ | ||
| │ | ||
| ┌──────────────────────────────────────────────────────────────┐ | ||
| │ Ascend 后端基础设施 │ | ||
| │ │ | ||
| │ ┌────────────────────────────────────────────────────────┐ │ | ||
| │ │ AscendBackend(核心管理) │ │ | ||
| │ │ - 设备/算子注册 - 分配器绑定 │ │ | ||
| │ │ - 设备信息日志 │ │ | ||
| │ └─────────┬──────────────────────────────────────────────┘ │ | ||
| │ │ │ | ||
| │ ┌─────────┴──────────┬──────────────┬─────────────────┐ │ | ||
| │ │ │ │ │ │ | ||
| │ ▼ ▼ ▼ ▼ │ | ||
| │ AscendDispatcher AscendAllocator Ascend Ops AscendCommon │ | ||
| │ (执行:算子/ MemoryManager (目前是add (共用代码) │ | ||
| │ 模块任务) (内存池) 未来图执行) │ | ||
| │ │ | ||
| │ │ | ||
| │ │ | ||
| └────────────────────────────┬─────────────────────────────────┘ | ||
| │ | ||
| ┌────────────────────────────▼──────────────────────────────────┐ | ||
| │ Ascend runtime │ | ||
| │ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │ | ||
| │ │ ATB 上下文 │ │ ACL 流 │ │ ATB/ACL 接口 │ │ | ||
| │ └──────────────┘ └──────────────┘ └─────────────────┘ │ | ||
| │ │ | ||
| │ ┌──────────────────────────────────────────────────────┐ │ | ||
| │ │ Ascend NPU 硬件(orangepi ai pro) │ │ | ||
| │ └──────────────────────────────────────────────────────┘ │ | ||
| └───────────────────────────────────────────────────────────────┘ | ||
|
|
||
| 关键模块 | ||
| -------- | ||
|
|
||
| 1. mLLM 框架层 | ||
|
|
||
| 框架层负责算子抽象、计算任务构建以及统一调度接口的提供。不依赖任何具体设备实现,仅通过 Backend 接口与底层后端交互。算子在该层被封装为可调度的任务(Task),并通过 DispatcherManager 提交给对应设备后端执行。 | ||
|
|
||
| 2. Ascend 后端基础设施层 | ||
|
|
||
| 该层是 Ascend Backend 的核心实现,负责承接来自框架层的算子任务,并将其映射到 Ascend 运行时执行。主要组成包括: | ||
|
|
||
| **AscendBackend** | ||
|
|
||
| - 后端入口与核心管理模块,负责后端注册、算子工厂管理、分配器与调度器绑定等。 | ||
|
|
||
| **AscendDispatcher** | ||
|
|
||
| - 任务调度与执行模块,负责驱动算子按照统一的生命周期(reshape / setup / forward)执行。 | ||
|
|
||
| **AscendAllocator / AscendMemoryManager** | ||
|
|
||
| - Ascend 设备内存管理模块,负责 Tensor 与 workspace 的分配、回收及内存池管理。 | ||
|
|
||
| **Ascend Ops / AscendCommon** | ||
|
|
||
| - Ascend 专用算子实现及 ATB / ACL 公共工具封装,屏蔽底层运行时细节。 | ||
|
|
||
| 3. Ascend Runtime 层 | ||
|
|
||
| 运行时层由 Ascend CANN 提供,包含 ATB 算子库、ACL 执行接口以及执行上下文与流管理。 | ||
|
|
||
| 4. Ascend 硬件层 | ||
|
|
||
| 最底层为 Ascend NPU 硬件,负责实际的计算执行。 | ||
|
|
||
| 执行流程(单算子路径) | ||
| ---------------------- | ||
|
|
||
| 1. Ascend Backend 初始化 | ||
| - Context 注册 Backend、Allocator、Dispatcher | ||
|
|
||
| 2. 输入 Tensor 准备 | ||
| - Ascend Tensor 分配 | ||
| - Host → Device 拷贝 | ||
|
|
||
| 3. 构建并提交算子任务 | ||
| - 创建 Ascend Op、Task | ||
| - 提交至 Dispatcher | ||
|
|
||
| 4. Ascend 上执行算子 | ||
| - reshape、setup、forward | ||
| - → ATB Operation Execute | ||
|
|
||
| 5. 结果回传与资源释放 | ||
| - Device → Host 拷贝验证 | ||
| - Tensor 资源释放 | ||
|
|
||
| 算子支持与映射 | ||
| -------------- | ||
|
|
||
| 支持的算子 | ||
| ~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| 当前版本的 Ascend Backend 以验证端到端执行链路为目标,实现了基于 ATB 的 **Add 算子** 支持。 | ||
|
|
||
| 算子映射策略 | ||
| ~~~~~~~~~~~~ | ||
|
|
||
| 在 Ascend Backend 中,框架算子并不直接依赖底层运行时实现,而是通过后端算子层进行统一映射。 | ||
|
|
||
| 后续扩展 | ||
| ~~~~~~~~~~~~~~~~~ | ||
|
|
||
| 在当前单算子执行路径稳定的基础上,Ascend Backend 将逐步扩展算子支持范围与执行模式。 | ||
|
|
||
| 添加新算子的方法 | ||
| ~~~~~~~~~~~~~~~~~ | ||
| Step 1:确认 ATB 支持与算子约束 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| - 确认 ATB 是否支持目标算子类型及对应参数结构 | ||
|
|
||
| Step 2:实现 Ascend 算子类 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| - 在 Ascend 后端中定义算子类 | ||
| - 实现统一的算子生命周期:reshape → setup → forward | ||
| - 在 forward 阶段调用 ATB 单算子完成执行 | ||
|
|
||
| Step 3:注册算子并接入调度链路 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| - 将新算子注册到 AscendBackend并完成相关适配 | ||
|
|
||
| Step 4:测试与验证 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| - 构建最小示例在 Ascend 设备上运行新算子 | ||
| - 与 CPU 参考结果对比,验证计算正确性 | ||
|
|
||
| 算子计算结果测试 | ||
| ~~~~~~~~~~~~~~~~~ | ||
| - 基准结果:在 CPU/参考实现上运行同一输入,获得期望输出。 | ||
| - 输入构造:覆盖典型维度与边界尺寸;固定随机种子,避免非确定性。 | ||
| - 误差度量:按 dtype 选择误差标准,如浮点用相对/绝对误差(rtol/atol),整型用全相等。 | ||
| - 数据搬运:确保 Host→Device / Device→Host 拷贝后再次比对,排查搬运或对齐问题。 | ||
|
|
||
| 内存与数据管理 | ||
| -------------- | ||
| - AscendMemoryManager(单例):按设备创建独立内存池,当前通过 `aclrtGetDeviceCount` 为每个 device 分配池。 | ||
| - AscendMemoryPool:预分配一个较大的空间,`aclrtMalloc(..., ACL_MEM_MALLOC_HUGE_FIRST)` 获取内存,维护 base/cur 指针与剩余空间。 | ||
| - 块分配策略:首先从 32B 对齐,优先从有空间 的 free_blocks 复用,若没有可用的则在池内线性切分(64B 对齐),并返回递增 block id。 | ||
| - 线程安全:分配/释放/取指针均持锁。 | ||
| - 多设备调度:通过当前 device id 选择对应内存池,确保多卡环境下内存隔离。 | ||
|
|
||
| 性能与调优 | ||
| ---------- | ||
| - 内存池复用:通过 AscendMemoryPool 预分配大块显存,并在线性切分/复用 block,减少频繁 aclrtMalloc/aclrtFree 带来的碎片与性能开销。 | ||
| - 对齐与访问:按 32B/64B 对齐划分内存块,兼顾 ATB/ACL 对齐需求与访存效率。 | ||
| - 执行路径简化:当前以单算子执行链路为主,重点验证端到端正确性与内存/数据通路的稳定性,为后续多算子/多流并行奠定基础。 | ||
| - 日志观测:通过 AscendCommon 与统一日志系统记录内存分配、算子执行等关键行为,用于简单的性能和资源使用分析。 | ||
|
|
||
| 测试与验证 | ||
| ---------- | ||
| - 结果正确性测试:在 CPU 或参考实现上计算结果,在 Ascend Backend 上运行相同输入,对比数值。 | ||
| - 时间测试:针对算子构不同步骤,记录执行时间。 | ||
| - 端到端验证:在示例工程中跑通完整链路,同时观察输出结果与耗时,确保调度、内存池和运行时组合下行为稳定。 | ||
|
|
||
| 后续扩展 | ||
| -------- | ||
| - 支持更多算子。 | ||
| - 更完整的 profiling 与可视化。 | ||
| - 上下文/图级缓存,减少重复创建。 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| Ascend Backend | ||
| ==================== | ||
|
|
||
| .. toctree:: | ||
| :maxdepth: 2 | ||
|
|
||
| core_design | ||
|
|
||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.