Skip to content

[Fix] Add PADDLE_WITH_DNNL macro detection in PaddleFleet build#691

Open
risemeup1 wants to merge 1 commit intoPaddlePaddle:developfrom
risemeup1:fix_with_mkl_bug
Open

[Fix] Add PADDLE_WITH_DNNL macro detection in PaddleFleet build#691
risemeup1 wants to merge 1 commit intoPaddlePaddle:developfrom
risemeup1:fix_with_mkl_bug

Conversation

@risemeup1
Copy link
Copy Markdown
Collaborator

@risemeup1 risemeup1 commented Mar 27, 2026

  • 功能

消除eb 代码必须要paddle wheel包with_mkl=off的限制,这个限制会导致编包成本大,paddle官网的wheel包(with_mkl=on)和EB 业务的包(with_mkl=off)不能统一,此PR合入之后,业务可以直接用官网的包,有助于发版镜像提速,减小编包维护成本

  • 问题描述

当 Paddle 使用 WITH_MKL=ON 编译时,会启用 OneDNN 库并定义 PADDLE_WITH_DNNL 宏。这会影响 DenseTensor 的内存布局,具体代码见:

// paddle/phi/core/storage_properties.h
#ifdef PADDLE_WITH_DNNL
struct OneDNNStorageProperties : public StorageProperties {
  dnnl::memory::format_tag format;  // 4 bytes
  dnnl::memory::desc mem_desc;       // 696 bytes
};
#endif
DenseTensor::storage_properties_ 指针在 Paddle 和 PaddleFleet 编译时指向的对象大小不同:

Paddle 端:可能指向 OneDNNStorageProperties(708 bytes)
PaddleFleet 端:只认识基类 StorageProperties(8 bytes)
这导致 ABI 不一致,PaddleFleet 自定义算子访问 Paddle 传入的 DenseTensor 时可能崩溃或出现未定义行为。
  • 修复方案

在 setup.py 的 nvcc_args 和 cxx_args 中同步添加 -DPADDLE_WITH_DNNL 宏定义,

Copilot AI review requested due to automatic review settings March 27, 2026 07:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在根据 Paddle 编译配置自动补齐 PADDLE_WITH_DNNL 宏,以避免在启用 OneDNN 时扩展算子与 Paddle ABI 不兼容;同时新增了两个与构建/工具安装相关的脚本文件。

Changes:

  • setup.py 中检测 paddle.core.is_compiled_with_onednn(),并在编译参数中按需添加 -DPADDLE_WITH_DNNL
  • 新增 NVSHMEM wheel 的一键构建脚本 ljd.sh
  • 新增 uv 的安装脚本 install.sh

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File Description
setup.py 根据 Paddle 编译时是否启用 OneDNN,给扩展编译参数追加 PADDLE_WITH_DNNL
ljd.sh 新增用于从源码构建 NVSHMEM wheel 的脚本(含环境配置与打包逻辑)
install.sh 新增第三方 uv 安装脚本(MIT 许可,下载并安装二进制)

ljd.sh Outdated
Comment on lines +70 to +76
if [ -z "$cuda_home" ] || [ -z "$cuda_major_ver" ]; then
err "无法自动检测 CUDA 版本。请安装 CUDA Toolkit 或手动设置 CUDA_HOME 环境变量"
fi

CUDA_CONFIGS=("${cuda_major_ver}:${cuda_home}")
log "自动检测到 CUDA ${cuda_full_ver} (大版本: ${cuda_major_ver}), CUDA_HOME=${cuda_home}"
}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释说明“CUDA 版本通过 CUDA_CONFIGS 变量控制”,但脚本在全局直接调用 detect_cuda 并在函数内强制设置 CUDA_CONFIGS=...,会覆盖用户预先设置的 CUDA_CONFIGS,且只能生成单一 CUDA 配置。建议仅在 CUDA_CONFIGS 为空时才执行 detect_cuda/赋值,或支持用户传入多个 CUDA_CONFIGS。

Copilot uses AI. Check for mistakes.
ljd.sh Outdated
Comment on lines +325 to +327
echo "安装方式:"
echo " pip uninstall nvidia-nvshmem-cu12 -y"
echo " pip install output/paddle_nvidia_nvshmem_cu12-*.whl"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

脚本最后打印的安装方式中写的是 pip uninstall nvidia-nvshmem-cu12 -y,但本脚本实际构建的包名是 paddle-nvidia-nvshmem-cu${cuda_ver}(pkg_name)。这会导致用户按提示卸载/安装失败或卸载错包。建议把提示命令改为与实际包名/输出 wheel 名一致,并根据 cuda_ver 动态生成。

Suggested change
echo "安装方式:"
echo " pip uninstall nvidia-nvshmem-cu12 -y"
echo " pip install output/paddle_nvidia_nvshmem_cu12-*.whl"
echo "安装方式 (按 CUDA 版本示例):"
for cuda_config in "${CUDA_CONFIGS[@]}"; do
local cuda_ver="${cuda_config%%:*}"
local pkg_name="paddle-nvidia-nvshmem-cu${cuda_ver}"
local wheel_prefix="${pkg_name//-/_}"
echo " # 对应 CUDA ${cuda_ver}"
echo " pip uninstall ${pkg_name} -y"
echo " pip install output/${wheel_prefix}-*.whl"
echo ""
done

Copilot uses AI. Check for mistakes.
install.sh Outdated
Comment on lines +1 to +26
#!/bin/sh
# shellcheck shell=dash
# shellcheck disable=SC2039 # local is non-POSIX
#
# Licensed under the MIT license
# <LICENSE-MIT or https://opensource.org/licenses/MIT>, at your
# option. This file may not be copied, modified, or distributed
# except according to those terms.

# This runs on Unix shells like bash/dash/ksh/zsh. It uses the common `local`
# extension. Note: Most shells limit `local` to 1 var per line, contra bash.

# Some versions of ksh have no `local` keyword. Alias it to `typeset`, but
# beware this makes variables global with f()-style function syntax in ksh93.
# mksh has this alias by default.
has_local() {
# shellcheck disable=SC2034 # deliberately unused
local _has_local
}

has_local 2>/dev/null || alias local=typeset

set -u

APP_NAME="uv"
APP_VERSION="0.10.2"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该 install.sh 实际是第三方项目 uv 的安装脚本(MIT 许可,内容为下载并安装 uv),与本 PR 标题/描述的 “PADDLE_WITH_DNNL 宏检测” 变更无直接关系,且目前仓库内也未看到有流程引用该脚本。建议:若确实需要,请移入 third_party/ 并在文档/CI 中明确使用场景;否则从本 PR 移除以避免引入无关的大体积脚本与供应链风险。

Copilot uses AI. Check for mistakes.
ljd.sh Outdated

# 优先从 nvcc 检测
if command -v nvcc &>/dev/null; then
cuda_full_ver=$(nvcc --version | grep -oP 'release \K[0-9]+\.[0-9]+')
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect_cuda 里使用了 grep -oP 解析 nvcc 版本(依赖 GNU grep 的 PCRE 支持),在部分发行版/精简镜像中可能不可用导致版本检测失败。建议改为更可移植的解析方式(例如用 sed/awk 的基本正则)。

Suggested change
cuda_full_ver=$(nvcc --version | grep -oP 'release \K[0-9]+\.[0-9]+')
# 使用 sed 的基本正则解析 nvcc 版本,避免依赖 GNU grep 的 PCRE 扩展
cuda_full_ver=$(nvcc --version | sed -n 's/.*release \([0-9][0-9]*\.[0-9][0-9]*\).*/\1/p' | head -n 1)

Copilot uses AI. Check for mistakes.
ljd.sh Outdated
Comment on lines +18 to +25
# build_all_wheels.sh
# 一键从零编译 NVSHMEM wheel 包(paddle-nvidia-nvshmem-cu{CUDA_VER})
#
# 用法:
# bash build_all_wheels.sh
#
# 支持:
# - CUDA 版本: 12, 13(通过 CUDA_CONFIGS 变量控制)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文件名为 ljd.sh,但脚本头部/用法说明写的是 build_all_wheels.sh(并提示用 bash build_all_wheels.sh 执行)。这会误导使用者和后续维护。建议统一脚本名称与注释/用法,或在注释中写实际文件名。

Suggested change
# build_all_wheels.sh
# 一键从零编译 NVSHMEM wheel 包(paddle-nvidia-nvshmem-cu{CUDA_VER})
#
# 用法:
# bash build_all_wheels.sh
#
# 支持:
# - CUDA 版本: 12, 13(通过 CUDA_CONFIGS 变量控制)
# ljd.sh - 一键从零编译 NVSHMEM wheel 包(paddle-nvidia-nvshmem-cu{CUDA_VER})
#
# 用法:
# bash ljd.sh
#
# 支持:
# - CUDA 版本: 12, 13(通过 CUDA_CONFIGS 变量控制)
# - CUDA 版本: 12, 13(通过 CUDA_CONFIGS 变量控制)

Copilot uses AI. Check for mistakes.
install.sh Outdated
Comment on lines +5 to +8
# Licensed under the MIT license
# <LICENSE-MIT or https://opensource.org/licenses/MIT>, at your
# option. This file may not be copied, modified, or distributed
# except according to those terms.
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

脚本声明为 MIT license(并引用 “<LICENSE-MIT ...>”),但仓库根目录目前只有 Apache 2.0 的 LICENSE 文件,未包含对应的 MIT 许可文本/归属信息。为满足 MIT 许可分发要求,建议补充第三方许可文件(例如在 third_party/uv/ 下放置 LICENSE-MIT 并保留版权声明),或不要在仓库中 vendoring 该脚本。

Suggested change
# Licensed under the MIT license
# <LICENSE-MIT or https://opensource.org/licenses/MIT>, at your
# option. This file may not be copied, modified, or distributed
# except according to those terms.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Copilot uses AI. Check for mistakes.
install.sh Outdated
Comment on lines +1 to +3
#!/bin/sh
# shellcheck shell=dash
# shellcheck disable=SC2039 # local is non-POSIX
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在仓库根目录新增名为 install.sh 的脚本容易被误认为是 PaddleFleet 的安装入口(但实际是 uv 安装器)。建议更换为语义明确的文件名(例如 uv_install.sh),或放入 third_party/uv/ 目录以避免与项目自身安装/构建脚本混淆。

Copilot uses AI. Check for mistakes.
ljd.sh Outdated
Comment on lines +86 to +87
export PIP_INDEX_URL=${PIP_INDEX_URL:-"https://pypi.tuna.tsinghua.edu.cn/simple"}
export PIP_TRUSTED_HOST=${PIP_TRUSTED_HOST:-"pypi.tuna.tsinghua.edu.cn"}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

脚本默认将 PIP_INDEX_URL/PIP_TRUSTED_HOST 指向清华镜像站,这属于环境/地域相关的默认行为,可能影响外部用户或 CI 的可重复性(例如内网/无外网环境)。建议不要在脚本内设置默认镜像源,或仅在用户显式传入时使用。

Suggested change
export PIP_INDEX_URL=${PIP_INDEX_URL:-"https://pypi.tuna.tsinghua.edu.cn/simple"}
export PIP_TRUSTED_HOST=${PIP_TRUSTED_HOST:-"pypi.tuna.tsinghua.edu.cn"}
# 如需使用自定义 pip 源,请在运行脚本前自行设置 PIP_INDEX_URL/PIP_TRUSTED_HOST

Copilot uses AI. Check for mistakes.
ljd.sh Outdated
Comment on lines +34 to +35
export http_proxy=agent.baidu.com:8188
export https_proxy=agent.baidu.com:8188
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该脚本在仓库内硬编码并导出了 http_proxy/https_proxy(agent.baidu.com:8188)。这会导致在非公司网络/CI 环境下构建直接失败,并且可能泄露内部网络配置。建议删除这两行,或改为通过环境变量可选开启(例如仅在变量已设置时透传)。

Suggested change
export http_proxy=agent.baidu.com:8188
export https_proxy=agent.baidu.com:8188

Copilot uses AI. Check for mistakes.
From00
From00 previously approved these changes Mar 27, 2026
Copy link
Copy Markdown
Collaborator

@From00 From00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@ShigureNyako ShigureNyako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我这边会给 request changes

这版问题不在于 PaddleFleet 少写了一个宏,而在于 Paddle 侧没有把“自定义扩展需要跟随主库 oneDNN ABI / 头文件布局”作为稳定接口导出出来。现在在 Fleet 里直接硬编码 -DPADDLE_WITH_DNNL,当前已经把 CI 打红:Build Fleet whl 里多处报 fatal error: dnnl.hpp: No such file or directory

我对照了 Paddle 主仓,几点关键信息:

  1. paddle/phi/core/dense_tensor.h 里已经把 oneDNN 的大对象挪到了 std::unique_ptr<StorageProperties>,注释里明确说明这样做是为了避免不同编译宏下 DenseTensor 布局漂移。所以这个 PR 描述里的“DenseTensor ABI 因 with_mkl/onednn 发生布局不一致”并不是当前最稳的切入点。
  2. python/paddle/utils/cpp_extension/extension_utils.py 目前只统一补了 Paddle/CUDA/HIP 相关 include/flag,并没有把 oneDNN 的 include/define 作为自定义扩展接口导出来。
  3. Paddle 自己的 test/auto_parallel/custom_op/utils.pyWITH_ONEDNN=ON 时还要手动补 ${ONEDNN_INSTALL_DIR}/include,这说明 upstream 也知道 oneDNN 头文件不是下游仓库仅靠猜一个宏就能稳住的。
  4. 另外 Paddle 当前打包逻辑里 oneDNN 安装目录已经是 third_party/install/onednn,但 header 安装路径清洗还保留着旧的 install/mkldnn/include/ 模式,这很像 wheel 里的头文件布局/暴露路径本身还有历史兼容问题。

所以我的结论是:

  • 不建议在 PaddleFleet 单独 hardcode -DPADDLE_WITH_DNNL
  • 最佳修复位置在 Paddle 主仓。 更合理的是由 Paddle 统一修 cpp_extension / wheel 头文件布局,把 oneDNN 所需的 include path 和 compile define 按实际 wheel 能力导出;最好让下游直接消费稳定接口,而不是自己维护内部宏。
  • 如果短期一定要在 Fleet 侧兜底,也只能做成“基于 Paddle 实际能力的条件判断 + 条件补 include path”,不能直接写死“官网包一定 with_mkl=on”。

建议这一个 PR 先不要按现在这个方向合。更推荐先在 Paddle 补齐上游接口,再让 PaddleFleet 跟着消费;否则这里会持续背负 ABI 漂移、分层职责错位和后续维护成本。

setup.py Outdated
"-gencode=arch=compute_90a,code=sm_90a",
"-gencode=arch=compute_100,code=sm_100",
"-DNDEBUG",
"-DPADDLE_WITH_DNNL", # Assume Paddle is compiled with OneDNN for ABI compatibility
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里把 -DPADDLE_WITH_DNNL 直接写死,会把 Paddle 的内部构建细节下沉到下游仓库,而且当前已经把 Build Fleet whl 打红:编译阶段多处 fatal error: dnnl.hpp: No such file or directory。我对照 Paddle 主仓看了一下,cpp_extension 目前并不会统一导出 oneDNN 的 include/define,Paddle 自己的 custom op 测试在 WITH_ONEDNN=ON 时还要手动补 ONEDNN_INSTALL_DIR/include。这个问题更像是 Paddle 侧的扩展接口/头文件暴露问题,不适合在 Fleet 这里单独 hardcode。

Copy link
Copy Markdown
Collaborator

@sneaxiy sneaxiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the DNNL include path in setup.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants