[Fix] Add PADDLE_WITH_DNNL macro detection in PaddleFleet build#691
[Fix] Add PADDLE_WITH_DNNL macro detection in PaddleFleet build#691risemeup1 wants to merge 1 commit intoPaddlePaddle:developfrom
Conversation
There was a problem hiding this comment.
Pull request overview
该 PR 旨在根据 Paddle 编译配置自动补齐 PADDLE_WITH_DNNL 宏,以避免在启用 OneDNN 时扩展算子与 Paddle ABI 不兼容;同时新增了两个与构建/工具安装相关的脚本文件。
Changes:
- 在
setup.py中检测paddle.core.is_compiled_with_onednn(),并在编译参数中按需添加-DPADDLE_WITH_DNNL - 新增 NVSHMEM wheel 的一键构建脚本
ljd.sh - 新增
uv的安装脚本install.sh
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| setup.py | 根据 Paddle 编译时是否启用 OneDNN,给扩展编译参数追加 PADDLE_WITH_DNNL |
| ljd.sh | 新增用于从源码构建 NVSHMEM wheel 的脚本(含环境配置与打包逻辑) |
| install.sh | 新增第三方 uv 安装脚本(MIT 许可,下载并安装二进制) |
ljd.sh
Outdated
| if [ -z "$cuda_home" ] || [ -z "$cuda_major_ver" ]; then | ||
| err "无法自动检测 CUDA 版本。请安装 CUDA Toolkit 或手动设置 CUDA_HOME 环境变量" | ||
| fi | ||
|
|
||
| CUDA_CONFIGS=("${cuda_major_ver}:${cuda_home}") | ||
| log "自动检测到 CUDA ${cuda_full_ver} (大版本: ${cuda_major_ver}), CUDA_HOME=${cuda_home}" | ||
| } |
There was a problem hiding this comment.
注释说明“CUDA 版本通过 CUDA_CONFIGS 变量控制”,但脚本在全局直接调用 detect_cuda 并在函数内强制设置 CUDA_CONFIGS=...,会覆盖用户预先设置的 CUDA_CONFIGS,且只能生成单一 CUDA 配置。建议仅在 CUDA_CONFIGS 为空时才执行 detect_cuda/赋值,或支持用户传入多个 CUDA_CONFIGS。
ljd.sh
Outdated
| echo "安装方式:" | ||
| echo " pip uninstall nvidia-nvshmem-cu12 -y" | ||
| echo " pip install output/paddle_nvidia_nvshmem_cu12-*.whl" |
There was a problem hiding this comment.
脚本最后打印的安装方式中写的是 pip uninstall nvidia-nvshmem-cu12 -y,但本脚本实际构建的包名是 paddle-nvidia-nvshmem-cu${cuda_ver}(pkg_name)。这会导致用户按提示卸载/安装失败或卸载错包。建议把提示命令改为与实际包名/输出 wheel 名一致,并根据 cuda_ver 动态生成。
| echo "安装方式:" | |
| echo " pip uninstall nvidia-nvshmem-cu12 -y" | |
| echo " pip install output/paddle_nvidia_nvshmem_cu12-*.whl" | |
| echo "安装方式 (按 CUDA 版本示例):" | |
| for cuda_config in "${CUDA_CONFIGS[@]}"; do | |
| local cuda_ver="${cuda_config%%:*}" | |
| local pkg_name="paddle-nvidia-nvshmem-cu${cuda_ver}" | |
| local wheel_prefix="${pkg_name//-/_}" | |
| echo " # 对应 CUDA ${cuda_ver}" | |
| echo " pip uninstall ${pkg_name} -y" | |
| echo " pip install output/${wheel_prefix}-*.whl" | |
| echo "" | |
| done |
install.sh
Outdated
| #!/bin/sh | ||
| # shellcheck shell=dash | ||
| # shellcheck disable=SC2039 # local is non-POSIX | ||
| # | ||
| # Licensed under the MIT license | ||
| # <LICENSE-MIT or https://opensource.org/licenses/MIT>, at your | ||
| # option. This file may not be copied, modified, or distributed | ||
| # except according to those terms. | ||
|
|
||
| # This runs on Unix shells like bash/dash/ksh/zsh. It uses the common `local` | ||
| # extension. Note: Most shells limit `local` to 1 var per line, contra bash. | ||
|
|
||
| # Some versions of ksh have no `local` keyword. Alias it to `typeset`, but | ||
| # beware this makes variables global with f()-style function syntax in ksh93. | ||
| # mksh has this alias by default. | ||
| has_local() { | ||
| # shellcheck disable=SC2034 # deliberately unused | ||
| local _has_local | ||
| } | ||
|
|
||
| has_local 2>/dev/null || alias local=typeset | ||
|
|
||
| set -u | ||
|
|
||
| APP_NAME="uv" | ||
| APP_VERSION="0.10.2" |
There was a problem hiding this comment.
该 install.sh 实际是第三方项目 uv 的安装脚本(MIT 许可,内容为下载并安装 uv),与本 PR 标题/描述的 “PADDLE_WITH_DNNL 宏检测” 变更无直接关系,且目前仓库内也未看到有流程引用该脚本。建议:若确实需要,请移入 third_party/ 并在文档/CI 中明确使用场景;否则从本 PR 移除以避免引入无关的大体积脚本与供应链风险。
ljd.sh
Outdated
|
|
||
| # 优先从 nvcc 检测 | ||
| if command -v nvcc &>/dev/null; then | ||
| cuda_full_ver=$(nvcc --version | grep -oP 'release \K[0-9]+\.[0-9]+') |
There was a problem hiding this comment.
detect_cuda 里使用了 grep -oP 解析 nvcc 版本(依赖 GNU grep 的 PCRE 支持),在部分发行版/精简镜像中可能不可用导致版本检测失败。建议改为更可移植的解析方式(例如用 sed/awk 的基本正则)。
| cuda_full_ver=$(nvcc --version | grep -oP 'release \K[0-9]+\.[0-9]+') | |
| # 使用 sed 的基本正则解析 nvcc 版本,避免依赖 GNU grep 的 PCRE 扩展 | |
| cuda_full_ver=$(nvcc --version | sed -n 's/.*release \([0-9][0-9]*\.[0-9][0-9]*\).*/\1/p' | head -n 1) |
ljd.sh
Outdated
| # build_all_wheels.sh | ||
| # 一键从零编译 NVSHMEM wheel 包(paddle-nvidia-nvshmem-cu{CUDA_VER}) | ||
| # | ||
| # 用法: | ||
| # bash build_all_wheels.sh | ||
| # | ||
| # 支持: | ||
| # - CUDA 版本: 12, 13(通过 CUDA_CONFIGS 变量控制) |
There was a problem hiding this comment.
文件名为 ljd.sh,但脚本头部/用法说明写的是 build_all_wheels.sh(并提示用 bash build_all_wheels.sh 执行)。这会误导使用者和后续维护。建议统一脚本名称与注释/用法,或在注释中写实际文件名。
| # build_all_wheels.sh | |
| # 一键从零编译 NVSHMEM wheel 包(paddle-nvidia-nvshmem-cu{CUDA_VER}) | |
| # | |
| # 用法: | |
| # bash build_all_wheels.sh | |
| # | |
| # 支持: | |
| # - CUDA 版本: 12, 13(通过 CUDA_CONFIGS 变量控制) | |
| # ljd.sh - 一键从零编译 NVSHMEM wheel 包(paddle-nvidia-nvshmem-cu{CUDA_VER}) | |
| # | |
| # 用法: | |
| # bash ljd.sh | |
| # | |
| # 支持: | |
| # - CUDA 版本: 12, 13(通过 CUDA_CONFIGS 变量控制) | |
| # - CUDA 版本: 12, 13(通过 CUDA_CONFIGS 变量控制) |
install.sh
Outdated
| # Licensed under the MIT license | ||
| # <LICENSE-MIT or https://opensource.org/licenses/MIT>, at your | ||
| # option. This file may not be copied, modified, or distributed | ||
| # except according to those terms. |
There was a problem hiding this comment.
脚本声明为 MIT license(并引用 “<LICENSE-MIT ...>”),但仓库根目录目前只有 Apache 2.0 的 LICENSE 文件,未包含对应的 MIT 许可文本/归属信息。为满足 MIT 许可分发要求,建议补充第三方许可文件(例如在 third_party/uv/ 下放置 LICENSE-MIT 并保留版权声明),或不要在仓库中 vendoring 该脚本。
| # Licensed under the MIT license | |
| # <LICENSE-MIT or https://opensource.org/licenses/MIT>, at your | |
| # option. This file may not be copied, modified, or distributed | |
| # except according to those terms. | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. |
install.sh
Outdated
| #!/bin/sh | ||
| # shellcheck shell=dash | ||
| # shellcheck disable=SC2039 # local is non-POSIX |
There was a problem hiding this comment.
在仓库根目录新增名为 install.sh 的脚本容易被误认为是 PaddleFleet 的安装入口(但实际是 uv 安装器)。建议更换为语义明确的文件名(例如 uv_install.sh),或放入 third_party/uv/ 目录以避免与项目自身安装/构建脚本混淆。
ljd.sh
Outdated
| export PIP_INDEX_URL=${PIP_INDEX_URL:-"https://pypi.tuna.tsinghua.edu.cn/simple"} | ||
| export PIP_TRUSTED_HOST=${PIP_TRUSTED_HOST:-"pypi.tuna.tsinghua.edu.cn"} |
There was a problem hiding this comment.
脚本默认将 PIP_INDEX_URL/PIP_TRUSTED_HOST 指向清华镜像站,这属于环境/地域相关的默认行为,可能影响外部用户或 CI 的可重复性(例如内网/无外网环境)。建议不要在脚本内设置默认镜像源,或仅在用户显式传入时使用。
| export PIP_INDEX_URL=${PIP_INDEX_URL:-"https://pypi.tuna.tsinghua.edu.cn/simple"} | |
| export PIP_TRUSTED_HOST=${PIP_TRUSTED_HOST:-"pypi.tuna.tsinghua.edu.cn"} | |
| # 如需使用自定义 pip 源,请在运行脚本前自行设置 PIP_INDEX_URL/PIP_TRUSTED_HOST |
ljd.sh
Outdated
| export http_proxy=agent.baidu.com:8188 | ||
| export https_proxy=agent.baidu.com:8188 |
There was a problem hiding this comment.
该脚本在仓库内硬编码并导出了 http_proxy/https_proxy(agent.baidu.com:8188)。这会导致在非公司网络/CI 环境下构建直接失败,并且可能泄露内部网络配置。建议删除这两行,或改为通过环境变量可选开启(例如仅在变量已设置时透传)。
| export http_proxy=agent.baidu.com:8188 | |
| export https_proxy=agent.baidu.com:8188 |
ShigureNyako
left a comment
There was a problem hiding this comment.
我这边会给 request changes。
这版问题不在于 PaddleFleet 少写了一个宏,而在于 Paddle 侧没有把“自定义扩展需要跟随主库 oneDNN ABI / 头文件布局”作为稳定接口导出出来。现在在 Fleet 里直接硬编码 -DPADDLE_WITH_DNNL,当前已经把 CI 打红:Build Fleet whl 里多处报 fatal error: dnnl.hpp: No such file or directory。
我对照了 Paddle 主仓,几点关键信息:
paddle/phi/core/dense_tensor.h里已经把 oneDNN 的大对象挪到了std::unique_ptr<StorageProperties>,注释里明确说明这样做是为了避免不同编译宏下 DenseTensor 布局漂移。所以这个 PR 描述里的“DenseTensor ABI 因 with_mkl/onednn 发生布局不一致”并不是当前最稳的切入点。python/paddle/utils/cpp_extension/extension_utils.py目前只统一补了 Paddle/CUDA/HIP 相关 include/flag,并没有把 oneDNN 的 include/define 作为自定义扩展接口导出来。- Paddle 自己的
test/auto_parallel/custom_op/utils.py在WITH_ONEDNN=ON时还要手动补${ONEDNN_INSTALL_DIR}/include,这说明 upstream 也知道 oneDNN 头文件不是下游仓库仅靠猜一个宏就能稳住的。 - 另外 Paddle 当前打包逻辑里 oneDNN 安装目录已经是
third_party/install/onednn,但 header 安装路径清洗还保留着旧的install/mkldnn/include/模式,这很像 wheel 里的头文件布局/暴露路径本身还有历史兼容问题。
所以我的结论是:
- 不建议在 PaddleFleet 单独 hardcode
-DPADDLE_WITH_DNNL。 - 最佳修复位置在 Paddle 主仓。 更合理的是由 Paddle 统一修
cpp_extension/ wheel 头文件布局,把 oneDNN 所需的 include path 和 compile define 按实际 wheel 能力导出;最好让下游直接消费稳定接口,而不是自己维护内部宏。 - 如果短期一定要在 Fleet 侧兜底,也只能做成“基于 Paddle 实际能力的条件判断 + 条件补 include path”,不能直接写死“官网包一定 with_mkl=on”。
建议这一个 PR 先不要按现在这个方向合。更推荐先在 Paddle 补齐上游接口,再让 PaddleFleet 跟着消费;否则这里会持续背负 ABI 漂移、分层职责错位和后续维护成本。
setup.py
Outdated
| "-gencode=arch=compute_90a,code=sm_90a", | ||
| "-gencode=arch=compute_100,code=sm_100", | ||
| "-DNDEBUG", | ||
| "-DPADDLE_WITH_DNNL", # Assume Paddle is compiled with OneDNN for ABI compatibility |
There was a problem hiding this comment.
这里把 -DPADDLE_WITH_DNNL 直接写死,会把 Paddle 的内部构建细节下沉到下游仓库,而且当前已经把 Build Fleet whl 打红:编译阶段多处 fatal error: dnnl.hpp: No such file or directory。我对照 Paddle 主仓看了一下,cpp_extension 目前并不会统一导出 oneDNN 的 include/define,Paddle 自己的 custom op 测试在 WITH_ONEDNN=ON 时还要手动补 ONEDNN_INSTALL_DIR/include。这个问题更像是 Paddle 侧的扩展接口/头文件暴露问题,不适合在 Fleet 这里单独 hardcode。
cf232ea to
086e16d
Compare
sneaxiy
left a comment
There was a problem hiding this comment.
Please fix the DNNL include path in setup.py.
消除eb 代码必须要paddle wheel包with_mkl=off的限制,这个限制会导致编包成本大,paddle官网的wheel包(with_mkl=on)和EB 业务的包(with_mkl=off)不能统一,此PR合入之后,业务可以直接用官网的包,有助于发版镜像提速,减小编包维护成本
当 Paddle 使用
WITH_MKL=ON编译时,会启用 OneDNN 库并定义PADDLE_WITH_DNNL宏。这会影响DenseTensor的内存布局,具体代码见:在 setup.py 的 nvcc_args 和 cxx_args 中同步添加 -DPADDLE_WITH_DNNL 宏定义,