Skip to content

[ROCm][Windows] Fixing undefined symbol linker error after exposing MIOpen symbols #2415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: release/2.8
Choose a base branch
from

Conversation

tvukovic-amd
Copy link

Changing TORCH_CUDA_CPP_API macros to TORCH_HIP_CPP in MIOpen header files according to closed PR in pytorch upstream.

jithunnair-amd and others added 30 commits July 18, 2025 01:06
(cherry picked from commit e294d4d with
modifications for release/2.8)

Reintroduce CIRCLE_TAG to be able to set PYTORCH_BUILD_VERSION without date
Cherry-pick of #2130

Validation:
http://rocm-ci.amd.com/job/rocm-pytorch-manylinux-wheel-builder-lw/155/

---------

Co-authored-by: Ethan Wee <Ethan.Wee@amd.com>
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit 2c220b2)
…ersion

since we need to rebuild magma for all supported architectures, we
cannot use upstream magma tarball anyway
…for py3.9;

upgrade tensorboard compatible with numpy 2

Co-authored-by: Ethan Wee <Ethan.Wee@amd.com>
(cherry picked from commit e867a3d)
(cherry picked from commit c7a1e32)
(cherry picked from commit 2a215e4)
(cherry picked from commit 866cc1d)
From upstream PR pytorch#154900
Resolves:https://ontrack-internal.amd.com/browse/SWDEV-536994

After following steps to reproduce in container
**registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16231_ubuntu22.04_py3.10_pytorch_lw_release2.7_no_user_66a18277**:
```
root@ubb4-rack-22:/var/lib/jenkins/pytorch# history
    1  cd /var/lib/jenkins/pytorch
    2  TEST_CONFIG=default CONTINUE_THROUGH_ERROR=True .ci/pytorch/test.sh

 Name:                    gfx90a
  Marketing Name:          AMD Instinct MI250X/MI250
+ MAYBE_ROCM=rocm/
+ [[ rocm == *xpu* ]]
+ [[ rocm != *-bazel-* ]]
+ pip_install ninja==1.10.2
+ pip_install_pkg='python3 -m pip install --progress-bar off'
+ python3 -m pip install --progress-bar off ninja==1.10.2
Collecting ninja==1.10.2
  Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (5.0 kB)
Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (108 kB)
Installing collected packages: ninja
  Attempting uninstall: ninja
    Found existing installation: ninja 1.11.1.4
    Uninstalling ninja-1.11.1.4:
      Successfully uninstalled ninja-1.11.1.4
Successfully installed ninja-1.10.2
+ export PATH=/root/.local/bin:/opt/venv/bin:/opt/rocm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+ PATH=/root/.local/bin:/opt/venv/bin:/opt/rocm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+ [[ rocm == *aarch64* ]]
+ [[ rocm == *asan* ]]
+ [[ rocm == *-debug* ]]
+ [[ rocm != *-bazel-* ]]
+ echo 'We are not in debug mode: rocm. Expect the assertion to pas
```
http://rocm-ci.amd.com/job/mainline-pytorch2.7-manylinux-wheels/126/

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit 0bd4030)
Cherry-pick of #2328

Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit fe3d37a)
(cherry picked from commit ba1ba26)
(cherry picked from commit 4e3462e)
This PR enables NHWC batchnorm on MIOpen in release/2.6 branch

`ROCm version >= 6.5` and `PYTORCH_MIOPEN_SUGGEST_NHWC_BATCHNORM=1`
environment variable required to enable nhwc batchnorm

This PR branch for `release/2.6` was built and tested using docker
image:
`compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:15845_ubuntu22.04_py3.10_pytorch_rocm6.4_internal_testing_8190c80`.

New batchnorm tests introduced:
train:
```
test_batchnorm_train_NCHW_vs_cpu_float32 (__main__.TestNN) ... ok (0.040s)
test_batchnorm_train_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.007s)
test_batchnorm_train_NCHW_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.005s)
test_batchnorm_train_NCHW_vs_native_float32 (__main__.TestNN) ... ok (0.089s)
test_batchnorm_train_NCHW_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_train_NHWC_vs_NCHW_float32 (__main__.TestNN) ... ok (0.020s)
test_batchnorm_train_NHWC_vs_NCHW_mixed_bfloat16 (__main__.TestNN) ... ok (0.006s)
test_batchnorm_train_NHWC_vs_NCHW_mixed_float16 (__main__.TestNN) ... ok (0.006s)
test_batchnorm_train_NHWC_vs_cpu_float32 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_train_NHWC_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_train_NHWC_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_train_NHWC_vs_native_float32 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_train_NHWC_vs_native_mixed_bfloat16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_train_NHWC_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.004s)
```

inference:
```
test_batchnorm_inference_NCHW_vs_cpu_float32 (__main__.TestNN) ... ok (0.025s)
test_batchnorm_inference_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.005s)
test_batchnorm_inference_NCHW_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_inference_NCHW_vs_native_float32 (__main__.TestNN) ... ok (0.102s)
test_batchnorm_inference_NCHW_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.003s)
test_batchnorm_inference_NHWC_vs_NCHW_float32 (__main__.TestNN) ... ok (0.018s)
test_batchnorm_inference_NHWC_vs_NCHW_mixed_bfloat16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_inference_NHWC_vs_NCHW_mixed_float16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_inference_NHWC_vs_cpu_float32 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_inference_NHWC_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_inference_NHWC_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_inference_NHWC_vs_native_float32 (__main__.TestNN) ... ok (0.003s)
test_batchnorm_inference_NHWC_vs_native_mixed_bfloat16 (__main__.TestNN) ... ok (0.003s)
test_batchnorm_inference_NHWC_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.003s)
```

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit 45896ac)
(cherry picked from commit 7010d60)
NHWC batchnorm enabled by default if ROCm>=7.0

(cherry picked from commit e0afc3a)
…m specific skips to generalized conditions (#2126)

Cherry-pick of #2100
Need to resolve conflicts

---------

Co-authored-by: iupaikov-amd <Iurii.Paikov@amd.com>
(cherry picked from commit f0c1ce8)
…and batchnorm (#2232)

Cherry-pick of #2209

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
(cherry picked from commit bf0079d)
Cherry-pick of #2214

Co-authored-by: Dmitry Nikolaev <139769634+dnikolaev-amd@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
(cherry picked from commit 5631e07)
…_rcpf(x) instead of 1.f/x (#1800)

Cherry-pick of #1688

Co-authored-by: Michael Halkenhäuser <michaelhalk@web.de>
Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
(cherry picked from commit f8544af)
(cherry picked from commit ed48754)
(cherry picked from commit d62a39e)
remove `xfail` from `batch_norm_with_update` op in `test_grad` and
`test_vmap_autograd_grad`

these tests are passed since ROCm6.4

Fixes https://ontrack-internal.amd.com/browse/SWDEV-529820

(cherry picked from commit 99b0758)
(cherry picked from commit a7044a4)
(cherry picked from commit 3fc00a8)
Ported mx fp8 part from #2046

Current test stats (accounting only blockwise scale tests)
PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k
test_blockwise -v

Ran 225 tests in 8.256s
FAILED (failures=1, skipped=150)
_74 test pass_

**fp8 mx data type sample test case.**

test_blockwise_mxfp8_numerics_test_case_name_data_random_scales_one_fast_accum_True_512_128_256_cuda
(__main__.TestFP8MatmulCudaCUDA)
hipblaslt-bench --api_method c -m 256 -n 512 -k 128 --lda 128 --ldb 128
--ldc 256 --ldd 256 --stride_a 0 --stride_b 0 --stride_c 0 --stride_d 0
--alpha 1 --beta 0 --transA T --transB N --batch_count 1 --scaleA 3
--scaleB 3 --a_type f8_r --b_type f8_r --c_type bf16_r --d_type bf16_r
--compute_type f32_r --algo_method index --solution_index -2146957310
--rotating 0 --cold_iters 0 --iters 0

---------

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
(cherry picked from commit d17e222)
Update CK for gfx950 (#49)

(cherry picked from commit 8ccfc47)
(cherry picked from commit b5d5987)
…ilure (#2204)

- Previously expected values were calculated on GPU using same dtype as
result values
- Now expected values are calculated on CPU using Float32 dtype
- This fixes a test failure that was observed on Navi48 where difference
between Eager mode (expected) and Inductor / Triton (result) did not
meet the error tolerance when sum was evaluated on an array of Float16
values

Co-authored-by: pnikolic-amd <pnikolic@amd.com>
(cherry picked from commit 8fe3cdd)
(cherry picked from commit 34f3b3e)
…2259)

* thread_work_size of 16 is giving better perf with many workloads

(cherry picked from commit 7edf50c)
Related to
c7a1e32
Fixes https://ontrack-internal.amd.com/browse/SWDEV-537835

Not a Navi specific failure:
```
  File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1412, in only_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/jenkins/pytorch/test/test_binary_ufuncs.py", line 1671, in test_cuda_tensor_pow_scalar_tensor
    self._test_pow(base, exp)
  File "/var/lib/jenkins/pytorch/test/test_binary_ufuncs.py", line 1482, in _test_pow
    self.assertEqual(actual, expected)
  File "/opt/conda/envs/py_3.12/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 4052, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: The values for attribute 'dtype' do not match: torch.float32 != torch.float64.
```

Using .to(actual) without specifying dtype/device assumes actual is a
tensor or tensor-like, which may fail silently or promote. Fixed by
explicitly matching dtype and device. Going from
pytorch#107302
Fix:
```
root@ubb4-rack-22:/var/lib/jenkins/pytorch# TEST_CONFIG=default HIP_VISIBLE_DEVICES=0 PYTORCH_TEST_WITH_ROCM=1 python test/test_binary_ufuncs.py TestBinaryUfuncsCUDA.test_cuda_tensor_pow_scalar_tensor_cuda
/opt/conda/envs/py_3.12/lib/python3.12/site-packages/hypothesis/entry_points.py:23: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources

Running tests...
----------------------------------------------------------------------
.
----------------------------------------------------------------------
Ran 1 test in 0.141s

OK

Generating XML reports...
root@ubb4-rack-22:/var/lib/jenkins/pytorch# pip list | grep numpy
numpy                   2.1.2

```

(cherry picked from commit a4d60fa)
Cherry-pick of #2319

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
(cherry picked from commit e725e2e)
This PR fixes the unit test,

test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED
[0.1163s]

```
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction
    tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda")
RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432]
```
This error occurs only on gfx1101 arch.

This error is coming from an integer overflow when another unit test,
test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel
creates a tensor with a huge numel, which overflows into a higher
torch.cuda.max_memory_reserved() when you call
test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction
afterward. To avoid this we introduced torch.cuda.empty_cache() and
torch.cuda.reset_peak_memory_stats() to clean up CUDA states.

JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295
(cherry picked from commit f86d184)
jeffdaily and others added 2 commits July 22, 2025 18:41
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo.

- fixes hipblaslt issue where memory use increased during graph capture
- preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE
- moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs
  - size_t getCUDABlasLtWorkspaceSize()
  - void* getCUDABlasLtWorkspace()

Fixes #2286.

Pull Request resolved: pytorch#156495
Approved by: https://github.com/eqy

(cherry picked from commit 996206e)
…g torch and numpy tensors (#2362)

Cherry-pick of #2340

Co-authored-by: Dmitry Nikolaev <139769634+dnikolaev-amd@users.noreply.github.com>
(cherry picked from commit 22c98ea)
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 25, 2025

Jenkins build for 8b2361443564d9d9753124fe9e39e3d8d187e114 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

…IOpen symbols (pytorch#156479)

Fixing undefined symbol linker error after [exposing MIOpen symbols](pytorch#154545).
This fix:

- Hipifies `aten/src/ATen/miopen` and `aten/src/ATen/native/miopen` files
- Adds `aten/src/ATen/miopen` and `aten/src/ATen/native/miopen` hipified source files to `all_hip_cpp` list

Pull Request resolved: pytorch#156479
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.