[ROCm][Windows] Fixing undefined symbol linker error after exposing MIOpen symbols #2415

tvukovic-amd · 2025-07-25T09:50:06Z

Changing TORCH_CUDA_CPP_API macros to TORCH_HIP_CPP in MIOpen header files according to closed PR in pytorch upstream.

(cherry picked from commit e294d4d with modifications for release/2.8) Reintroduce CIRCLE_TAG to be able to set PYTORCH_BUILD_VERSION without date

(Changes selected from ef226be and pytorch@fadc936)

Cherry-pick of #2130 Validation: http://rocm-ci.amd.com/job/rocm-pytorch-manylinux-wheel-builder-lw/155/ --------- Co-authored-by: Ethan Wee <Ethan.Wee@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit 2c220b2)

…ersion since we need to rebuild magma for all supported architectures, we cannot use upstream magma tarball anyway

…for py3.9; upgrade tensorboard compatible with numpy 2 Co-authored-by: Ethan Wee <Ethan.Wee@amd.com> (cherry picked from commit e867a3d) (cherry picked from commit c7a1e32) (cherry picked from commit 2a215e4) (cherry picked from commit 866cc1d)

From upstream PR pytorch#154900 Resolves:https://ontrack-internal.amd.com/browse/SWDEV-536994 After following steps to reproduce in container **registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16231_ubuntu22.04_py3.10_pytorch_lw_release2.7_no_user_66a18277**: ``` root@ubb4-rack-22:/var/lib/jenkins/pytorch# history 1 cd /var/lib/jenkins/pytorch 2 TEST_CONFIG=default CONTINUE_THROUGH_ERROR=True .ci/pytorch/test.sh Name: gfx90a Marketing Name: AMD Instinct MI250X/MI250 + MAYBE_ROCM=rocm/ + [[ rocm == *xpu* ]] + [[ rocm != *-bazel-* ]] + pip_install ninja==1.10.2 + pip_install_pkg='python3 -m pip install --progress-bar off' + python3 -m pip install --progress-bar off ninja==1.10.2 Collecting ninja==1.10.2 Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (5.0 kB) Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (108 kB) Installing collected packages: ninja Attempting uninstall: ninja Found existing installation: ninja 1.11.1.4 Uninstalling ninja-1.11.1.4: Successfully uninstalled ninja-1.11.1.4 Successfully installed ninja-1.10.2 + export PATH=/root/.local/bin:/opt/venv/bin:/opt/rocm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin + PATH=/root/.local/bin:/opt/venv/bin:/opt/rocm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin + [[ rocm == *aarch64* ]] + [[ rocm == *asan* ]] + [[ rocm == *-debug* ]] + [[ rocm != *-bazel-* ]] + echo 'We are not in debug mode: rocm. Expect the assertion to pas ``` http://rocm-ci.amd.com/job/mainline-pytorch2.7-manylinux-wheels/126/ --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit 0bd4030)

(cherry picked from upstream commit 04bd7e6)

Cherry-pick of #2328 Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit fe3d37a)

…to enable PyTorch build on ROCm7.0

(cherry picked from commit ba1ba26) (cherry picked from commit 4e3462e)

This PR enables NHWC batchnorm on MIOpen in release/2.6 branch `ROCm version >= 6.5` and `PYTORCH_MIOPEN_SUGGEST_NHWC_BATCHNORM=1` environment variable required to enable nhwc batchnorm This PR branch for `release/2.6` was built and tested using docker image: `compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:15845_ubuntu22.04_py3.10_pytorch_rocm6.4_internal_testing_8190c80`. New batchnorm tests introduced: train: ``` test_batchnorm_train_NCHW_vs_cpu_float32 (__main__.TestNN) ... ok (0.040s) test_batchnorm_train_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.007s) test_batchnorm_train_NCHW_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.005s) test_batchnorm_train_NCHW_vs_native_float32 (__main__.TestNN) ... ok (0.089s) test_batchnorm_train_NCHW_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_train_NHWC_vs_NCHW_float32 (__main__.TestNN) ... ok (0.020s) test_batchnorm_train_NHWC_vs_NCHW_mixed_bfloat16 (__main__.TestNN) ... ok (0.006s) test_batchnorm_train_NHWC_vs_NCHW_mixed_float16 (__main__.TestNN) ... ok (0.006s) test_batchnorm_train_NHWC_vs_cpu_float32 (__main__.TestNN) ... ok (0.004s) test_batchnorm_train_NHWC_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_train_NHWC_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_train_NHWC_vs_native_float32 (__main__.TestNN) ... ok (0.004s) test_batchnorm_train_NHWC_vs_native_mixed_bfloat16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_train_NHWC_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.004s) ``` inference: ``` test_batchnorm_inference_NCHW_vs_cpu_float32 (__main__.TestNN) ... ok (0.025s) test_batchnorm_inference_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.005s) test_batchnorm_inference_NCHW_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_inference_NCHW_vs_native_float32 (__main__.TestNN) ... ok (0.102s) test_batchnorm_inference_NCHW_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.003s) test_batchnorm_inference_NHWC_vs_NCHW_float32 (__main__.TestNN) ... ok (0.018s) test_batchnorm_inference_NHWC_vs_NCHW_mixed_bfloat16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_inference_NHWC_vs_NCHW_mixed_float16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_inference_NHWC_vs_cpu_float32 (__main__.TestNN) ... ok (0.004s) test_batchnorm_inference_NHWC_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_inference_NHWC_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.004s) test_batchnorm_inference_NHWC_vs_native_float32 (__main__.TestNN) ... ok (0.003s) test_batchnorm_inference_NHWC_vs_native_mixed_bfloat16 (__main__.TestNN) ... ok (0.003s) test_batchnorm_inference_NHWC_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.003s) ``` --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit 45896ac) (cherry picked from commit 7010d60)

(cherry picked from commit 03c7da0)

NHWC batchnorm enabled by default if ROCm>=7.0 (cherry picked from commit e0afc3a)

Bug introduced by 03c7da0 (cherry picked from commit bbd0112)

…m specific skips to generalized conditions (#2126) Cherry-pick of #2100 Need to resolve conflicts --------- Co-authored-by: iupaikov-amd <Iurii.Paikov@amd.com> (cherry picked from commit f0c1ce8)

…and batchnorm (#2232) Cherry-pick of #2209 Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit bf0079d)

Cherry-pick of #2214 Co-authored-by: Dmitry Nikolaev <139769634+dnikolaev-amd@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit 5631e07)

…_rcpf(x) instead of 1.f/x (#1800) Cherry-pick of #1688 Co-authored-by: Michael Halkenhäuser <michaelhalk@web.de> Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com> (cherry picked from commit f8544af) (cherry picked from commit ed48754) (cherry picked from commit d62a39e)

remove `xfail` from `batch_norm_with_update` op in `test_grad` and `test_vmap_autograd_grad` these tests are passed since ROCm6.4 Fixes https://ontrack-internal.amd.com/browse/SWDEV-529820 (cherry picked from commit 99b0758) (cherry picked from commit a7044a4) (cherry picked from commit 3fc00a8)

Ported mx fp8 part from #2046 Current test stats (accounting only blockwise scale tests) PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Ran 225 tests in 8.256s FAILED (failures=1, skipped=150) _74 test pass_ **fp8 mx data type sample test case.** test_blockwise_mxfp8_numerics_test_case_name_data_random_scales_one_fast_accum_True_512_128_256_cuda (__main__.TestFP8MatmulCudaCUDA) hipblaslt-bench --api_method c -m 256 -n 512 -k 128 --lda 128 --ldb 128 --ldc 256 --ldd 256 --stride_a 0 --stride_b 0 --stride_c 0 --stride_d 0 --alpha 1 --beta 0 --transA T --transB N --batch_count 1 --scaleA 3 --scaleB 3 --a_type f8_r --b_type f8_r --c_type bf16_r --d_type bf16_r --compute_type f32_r --algo_method index --solution_index -2146957310 --rotating 0 --cold_iters 0 --iters 0 --------- Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> (cherry picked from commit d17e222)

Update CK for gfx950 (#49) (cherry picked from commit 8ccfc47) (cherry picked from commit b5d5987)

…ilure (#2204) - Previously expected values were calculated on GPU using same dtype as result values - Now expected values are calculated on CPU using Float32 dtype - This fixes a test failure that was observed on Navi48 where difference between Eager mode (expected) and Inductor / Triton (result) did not meet the error tolerance when sum was evaluated on an array of Float16 values Co-authored-by: pnikolic-amd <pnikolic@amd.com> (cherry picked from commit 8fe3cdd) (cherry picked from commit 34f3b3e)

…2259) * thread_work_size of 16 is giving better perf with many workloads (cherry picked from commit 7edf50c)

fixes https://ontrack-internal.amd.com/browse/SWDEV-522391 for PT 2.7 (cherry picked from commit df38cca)

Related to c7a1e32 Fixes https://ontrack-internal.amd.com/browse/SWDEV-537835 Not a Navi specific failure: ``` File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1412, in only_fn return fn(slf, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/jenkins/pytorch/test/test_binary_ufuncs.py", line 1671, in test_cuda_tensor_pow_scalar_tensor self._test_pow(base, exp) File "/var/lib/jenkins/pytorch/test/test_binary_ufuncs.py", line 1482, in _test_pow self.assertEqual(actual, expected) File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 4052, in assertEqual raise error_metas.pop()[0].to_error( AssertionError: The values for attribute 'dtype' do not match: torch.float32 != torch.float64. ``` Using .to(actual) without specifying dtype/device assumes actual is a tensor or tensor-like, which may fail silently or promote. Fixed by explicitly matching dtype and device. Going from pytorch#107302 Fix: ``` root@ubb4-rack-22:/var/lib/jenkins/pytorch# TEST_CONFIG=default HIP_VISIBLE_DEVICES=0 PYTORCH_TEST_WITH_ROCM=1 python test/test_binary_ufuncs.py TestBinaryUfuncsCUDA.test_cuda_tensor_pow_scalar_tensor_cuda /opt/conda/envs/py_3.12/lib/python3.12/site-packages/hypothesis/entry_points.py:23: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources Running tests... ---------------------------------------------------------------------- . ---------------------------------------------------------------------- Ran 1 test in 0.141s OK Generating XML reports... root@ubb4-rack-22:/var/lib/jenkins/pytorch# pip list | grep numpy numpy 2.1.2 ``` (cherry picked from commit a4d60fa)

Cherry-pick of #2319 Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit e725e2e)

This PR fixes the unit test, test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED [0.1163s] ``` Traceback (most recent call last): File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda") RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432] ``` This error occurs only on gfx1101 arch. This error is coming from an integer overflow when another unit test, test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel creates a tensor with a huge numel, which overflows into a higher torch.cuda.max_memory_reserved() when you call test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction afterward. To avoid this we introduced torch.cuda.empty_cache() and torch.cuda.reset_peak_memory_stats() to clean up CUDA states. JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295 (cherry picked from commit f86d184)

Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo. - fixes hipblaslt issue where memory use increased during graph capture - preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE - moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs - size_t getCUDABlasLtWorkspaceSize() - void* getCUDABlasLtWorkspace() Fixes #2286. Pull Request resolved: pytorch#156495 Approved by: https://github.com/eqy (cherry picked from commit 996206e)

…g torch and numpy tensors (#2362) Cherry-pick of #2340 Co-authored-by: Dmitry Nikolaev <139769634+dnikolaev-amd@users.noreply.github.com> (cherry picked from commit 22c98ea)

rocm-repo-management-api · 2025-07-25T10:32:50Z

Jenkins build for 8b2361443564d9d9753124fe9e39e3d8d187e114 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

…IOpen symbols (pytorch#156479) Fixing undefined symbol linker error after [exposing MIOpen symbols](pytorch#154545). This fix: - Hipifies `aten/src/ATen/miopen` and `aten/src/ATen/native/miopen` files - Adds `aten/src/ATen/miopen` and `aten/src/ATen/native/miopen` hipified source files to `all_hip_cpp` list Pull Request resolved: pytorch#156479 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

jithunnair-amd and others added 30 commits July 18, 2025 01:06

[release/2.8] Enable wheels

8734655

(cherry picked from commit e294d4d with modifications for release/2.8) Reintroduce CIRCLE_TAG to be able to set PYTORCH_BUILD_VERSION without date

Updates to build for Noble (Ubuntu 24.04) and py3.12

dc95b0c

(Changes selected from ef226be and pytorch@fadc936)

[release/2.8] Replace upstream install_rocm_magma.sh with rocm fork v…

b4c293a

…ersion since we need to rebuild magma for all supported architectures, we cannot use upstream magma tarball anyway

[ROCm] Remove use of warpsize on host-side compilation (pytorch#156979)

90d7d4b

(cherry picked from upstream commit 04bd7e6)

[release/2.8] Improve C10_WARP_SIZE compatibility

186180d

Cherry-pick of #2328 Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit fe3d37a)

Fix sha256 for aotriton ROCm7.0 tarball

8e7b99f

Update third_party/composable_kernel submodule commit as per 80cca70 …

d7c64fc

…to enable PyTorch build on ROCm7.0

Use ROCm/triton and update triton.txt

b81d4d1

Add related_commits file (#2396)

98e9537

Add QA automation scripts for running PyTorch unit tests

12a145a

(cherry picked from commit ba1ba26) (cherry picked from commit 4e3462e)

test_decompose_mem_bound_mm.py tolerance increase for navi3x

fb20451

(cherry picked from commit 03c7da0)

[release/2.7] enable NHWC batchnorm by default on ROCm7.0+ (#2180)

32449c9

NHWC batchnorm enabled by default if ROCm>=7.0 (cherry picked from commit e0afc3a)

[release/2.7] import 'Dict' to fix common_utils.py (#2181)

23f0b5f

Bug introduced by 03c7da0 (cherry picked from commit bbd0112)

[AUTOGENERATED] [release/2.7] [rocm6.4_internal_testing] Replaced ROC…

48630d8

…m specific skips to generalized conditions (#2126) Cherry-pick of #2100 Need to resolve conflicts --------- Co-authored-by: iupaikov-amd <Iurii.Paikov@amd.com> (cherry picked from commit f0c1ce8)

[release/2.7] [SWDEV-535259] enable miopen channels last 3d for conv …

ae17c3a

…and batchnorm (#2232) Cherry-pick of #2209 Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit bf0079d)

[AUTOGENERATED] [release/2.7] Add 3D batchnorm tests (#2243)

e4d62b1

Cherry-pick of #2214 Co-authored-by: Dmitry Nikolaev <139769634+dnikolaev-amd@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit 5631e07)

Extend CK gemm/sdpa support to gfx950 (#45)

e0160f1

Update CK for gfx950 (#49) (cherry picked from commit 8ccfc47) (cherry picked from commit b5d5987)

[ROCm] Set thread_work_size to 16 for vectorized elementwise kernels (#…

01857c6

…2259) * thread_work_size of 16 is giving better perf with many workloads (cherry picked from commit 7edf50c)

[release/2.7] Fix SDPA skip logic (#2281)

e60c0c4

fixes https://ontrack-internal.amd.com/browse/SWDEV-522391 for PT 2.7 (cherry picked from commit df38cca)

[AUTOGENERATED] [release/2.7] fix jit_utils.cpp (#2320)

80e8974

Cherry-pick of #2319 Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit e725e2e)

jeffdaily and others added 2 commits July 22, 2025 18:41

[AUTOGENERATED] [release/2.7] [release/2.6] Fix dtype before comparin…

3b7f377

…g torch and numpy tensors (#2362) Cherry-pick of #2340 Co-authored-by: Dmitry Nikolaev <139769634+dnikolaev-amd@users.noreply.github.com> (cherry picked from commit 22c98ea)

tvukovic-amd requested review from jeffdaily and jithunnair-amd as code owners July 25, 2025 09:50

jithunnair-amd force-pushed the release/2.8 branch from 3b41cb5 to a0ffd4f Compare July 29, 2025 14:41

jithunnair-amd requested review from jataylo and pruthvistony as code owners July 29, 2025 14:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm][Windows] Fixing undefined symbol linker error after exposing MIOpen symbols #2415

[ROCm][Windows] Fixing undefined symbol linker error after exposing MIOpen symbols #2415

tvukovic-amd commented Jul 25, 2025

Uh oh!

rocm-repo-management-api bot commented Jul 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

[ROCm][Windows] Fixing undefined symbol linker error after exposing MIOpen symbols #2415

Are you sure you want to change the base?

[ROCm][Windows] Fixing undefined symbol linker error after exposing MIOpen symbols #2415

Conversation

tvukovic-amd commented Jul 25, 2025

Uh oh!

rocm-repo-management-api bot commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jul 25, 2025 •

edited

Loading