HIP: Add RDNA3 WMMA support to MMF #17495

unverbraucht · 2025-11-25T14:35:21Z

Based on the work by @zhang-hui-yulo for RDNA4 I attempted to backport the WMMA MMF support to RDNA3. Also ports the RDNA4 WMMA-MMQ improvements by @jiachengjason from PR #17156 to RDNA3.

The differences to RDNA4 are:

RDNA3 has no FP8 support for in WMMA (INT8 is supported by hardware)
RDNA3 has a different tile size

The results for granite 1b 400m look great:

GPU	Model	Microbatch size	Test	t/s master	t/s `ba25661`	Speedup
RX 7900 XT	granitemoe ?B F16	1	pp512	283.42	286.08	1.01
RX 7900 XT	granitemoe ?B F16	2	pp512	124.99	668.81	5.35
RX 7900 XT	granitemoe ?B F16	4	pp512	205.77	1224.45	5.95
RX 7900 XT	granitemoe ?B F16	8	pp512	377.29	1881.51	4.99
RX 7900 XT	granitemoe ?B F16	16	pp512	640.67	3181.89	4.97
RX 7900 XT	granitemoe ?B F16	32	pp512	1024.92	5654.28	5.52
RX 7900 XT	granitemoe ?B F16	64	pp512	2052.33	9817.10	4.78
RX 7900 XT	granitemoe ?B F16	128	pp512	3622.50	15972.81	4.41
RX 7900 XT	granitemoe ?B F16	256	pp512	6007.40	22525.58	3.75
RX 7900 XT	granitemoe ?B F16	512	pp512	9174.28	27815.62	3.03

EDIT: the performance regression for GPT OSS 20b has been fixed, now we have moderate speed-up:

GPU	Model	Microbatch size	Test	t/s `55ab25c`	t/s `8aed111`	Speedup
RX 7900 XT	gpt-oss 20B Q8_0	1	pp512	184.01	181.51	0.99
RX 7900 XT	gpt-oss 20B Q8_0	2	pp512	194.39	216.68	1.11
RX 7900 XT	gpt-oss 20B Q8_0	4	pp512	331.38	386.97	1.17
RX 7900 XT	gpt-oss 20B Q8_0	8	pp512	535.49	656.94	1.23
RX 7900 XT	gpt-oss 20B Q8_0	16	pp512	683.39	772.00	1.13
RX 7900 XT	gpt-oss 20B Q8_0	32	pp512	898.96	1049.12	1.17
RX 7900 XT	gpt-oss 20B Q8_0	64	pp512	1089.26	1358.59	1.25
RX 7900 XT	gpt-oss 20B Q8_0	128	pp512	1712.74	1935.43	1.13
RX 7900 XT	gpt-oss 20B Q8_0	256	pp512	2552.29	2828.21	1.11
RX 7900 XT	gpt-oss 20B Q8_0	512	pp512	3298.97	3594.53	1.09

CC @jiachengjason

Key Changes Made: 1. ggml/src/ggml-cuda/common.cuh: - Extended AMD_WMMA_AVAILABLE macro to include both RDNA3 and RDNA4 - Updated amd_wmma_available() to return true for both architectures 2. ggml/src/ggml-cuda/mma.cuh: - Tile structures: Added RDNA3-specific tile sizes: - RDNA4: 4 half2 = 8 FP16 elements (compact layout) - RDNA3: 8 half2 = 16 FP16 elements (duplicate layout required by hardware) - MMA operations: Added RDNA3 intrinsics: - FP16: __builtin_amdgcn_wmma_f32_16x16x16_f16_w32 (no gfx12 suffix) - BF16: __builtin_amdgcn_wmma_f32_16x16x16_bf16_w32 - Uses halfx16_t/bf16x16_t for RDNA3 vs halfx8_t/bf16x8_t for RDNA4 - Load operations: Added conditional handling for 32-byte RDNA3 tiles using two 16-byte copies 3. ggml/src/ggml-cuda/mmf.cu: - Updated to use amd_wmma_available() for both RDNA3 and RDNA4

am17an · 2025-11-25T14:41:15Z

gpt-oss would not be using the MMF path (it uses MMQ), you might have some variation in your measurements

unverbraucht · 2025-11-25T14:44:20Z

@am17an you're right, since we don't have integer WMMA on RDNA3 this should not be using this code path. I might have other commits in my PR that I don't have in my master build, or maybe my changes mess with the FP16 code path.

I'll look into using the same master build, and also check with other FP16 models

…use MMQ with integer WMMA operations (hardware-accelerated)

hjc4869 · 2025-11-25T15:14:37Z

RDNA3 does have __builtin_amdgcn_wmma_i32_16x16x16_iu8_w32 intrinsic (V_WMMA_I32_16X16X16_IU8) which is a little different from RDNA4's _gfx12 variant but has the same functionality. Though it's the same ops/cycle as F16/BF16 so probably only gonna save some registers / bandwidth here and there.

jiachengjason · 2025-11-25T15:28:22Z

I am currently working on enabling __builtin_amdgcn_wmma_i32_16x16x16_iu8_w32 for RDNA3 with similar fashion as #17156

unverbraucht · 2025-11-25T16:21:00Z

@jiachengjason great, looking forward to it :)

Maybe you can also help me with this, since it touches on MMQ: I am trying to find the source of the regression of GPT OSS 20b regression. It seems to me that RDNA3 no longer uses MMQ with DP4A instructions for batches < 512, which is the fast path for RDNA3. I'm trying to debug this right in my last commits.

JohannesGaessler

According to the AMD ISA documentation RDNA3 supports integer tensor cores. Please change the comments to say that it's not implemented rather than not supported.
Please always add a comment for an #endif to indicate which #if/#ifdef it is closing.
The get_i and get_j methods are not going to work correctly if you mirror the data for RDNA3. Please either implement them correctly or replace them with NO_DEVICE_CODE for RDNA3.
The code in mma.cuh is currently in a bad state in terms of maintainability and is in dire need of a refactor. However, I consider this to be a job for me, the maintainer, rather than contributors. So no action from your side is necessary, for now it's fine to pile on hacky solutions. I just want to give you a heads-up that the code is subject to change once RDNA3, RDNA4, and CDNA are properly supported and I know what the requirements are.

unverbraucht · 2025-11-27T08:27:09Z

@JohannesGaessler thanks for the feedback.

RDNA3 indeed supports INT8 in WMMA, and I'll investigate that. It doesn't support FP8 and the sparse WMMA is also missing. Looking into get_i and get_j.

Regarding your new code in #17505 - does it even make sense to investigate this code here more or should I wait for that PR to be merged and then attempt to add this to the new MMA kernel?

JohannesGaessler · 2025-11-27T10:08:52Z

RDNA3 support should be largely independent of the changes I'm making to mma.cuh as long as you're only working on the kernel in mmf.cuh. For the kernel in fattn-mma-f16.cuh my PR should very much be merged first and then correct implementations for get_i and get_j will be 100% required.

Details 1. Separated RDNA3 and RDNA4 integer WMMA implementations in mma.cuh: - RDNA4: Uses __builtin_amdgcn_wmma_i32_16x16x16_iu8_w32_gfx12 with int32x2_t (original path preserved) - RDNA3: Uses __builtin_amdgcn_wmma_i32_16x16x16_iu8_w32 with int32x4_t (new path added) 2. Both architectures now share: - AMD_WMMA_INT_AVAILABLE macro (enables shared optimizations) - Memory layout settings (mmq_x_max=128, granularity=16, nwarps=8) - The tile<16, 4, int> optimizations in mmq.cuh 3. RDNA4-exclusive features remain untouched: - FP8/BF8 WMMA operations - Specific RDNA4 optimizations behind #if defined(RDNA4) guards

unverbraucht · 2025-11-27T11:54:54Z

@JohannesGaessler I've updated the code to make use of int8 wmma. get_i and get_j are working now. endifs commented.

@jiachengjason as far as I can tell my changes cover the uses of __builtin_amdgcn_wmma_i32_16x16x16_iu8_w32 that I can see. This also fixes the GPT-OSS 20b regression. Please have a look at this draft.

I will wait with getting the conflicts merged since we want to merge #17505 first.

jiachengjason · 2025-11-27T15:35:46Z

Hi @unverbraucht, I don't believe the implementation for int8 is correct. (by running HIP_VISIBLE_DEVICES=0 ./build/bin/test-backend-ops test -o MUL_MAT > output.txt) The the quantization cases are failing for n>8. I think the mapping for loading the data into the register are incorrect for RDNA3. (load_generic)

I have attached the output of backend-ops test
output.txt

zhang-hui-yulo · 2025-11-28T06:09:33Z

Hi @unverbraucht, I don't believe the implementation for int8 is correct. (by running HIP_VISIBLE_DEVICES=0 ./build/bin/test-backend-ops test -o MUL_MAT > output.txt) The the quantization cases are failing for n>8. I think the mapping for loading the data into the register are incorrect for RDNA3. (load_generic)

I have attached the output of backend-ops test output.txt

I think @unverbraucht uses the original tile<I, J, T> for RDNA3 int8 wmma and doesn't deal with data loading well in load_generic, I wonder if it's possible to move matrix A, B out of tile<I, J, T> as it covers both row and col major matrix for AMD.

Also, I will suggest to clean up get_i and get_j in tile<I, J, half2> and tile<I, J, nv_bfloat162> as the iteration based on ne will be not correct.

My suggestion will be tile<I, J, T, transposed = true> for matrix A, B for RDNA int8 and tile<I, J, T> for matrix C, then load_generic can use ggml_cuda_memcpy_1 with position get_i(0) + get_j(0) for all RDNA wmma layout as they all have continues data, just remove the ugly int64_t copy.

JohannesGaessler · 2025-11-28T13:44:15Z

My suggestion will be tile<I, J, T, transposed = true> for matrix A, B for RDNA int8 and tile<I, J, T> for matrix C

Please only do that if the data is actually transposed. As I said, the current state of mma.cuh is very messy. I intend to later do a refactor with a better design but I first need to figure out what the exact requirements are. As of right now, if you just need to specify a different data layout for C/D and A/B, preferably do this via the data_split enum (may make sense to rename this to data_layout).

zhang-hui-yulo · 2025-11-29T02:08:20Z

My suggestion will be tile<I, J, T, transposed = true> for matrix A, B for RDNA int8 and tile<I, J, T> for matrix C

Please only do that if the data is actually transposed. As I said, the current state of mma.cuh is very messy. I intend to later do a refactor with a better design but I first need to figure out what the exact requirements are. As of right now, if you just need to specify a different data layout for C/D and A/B, preferably do this via the data_split enum (may make sense to rename this to data_layout).

I prefer data_layout enum including row-major and col-major, all tile for NV shall be row-major. For AMD, matrix A and B shall be row-major (follow NV's pattern) and matrix C shall be col-major (transpose I and J), this makes more sense and more friendly to load_generic.

JohannesGaessler · 2025-11-29T10:12:56Z

Agreed, the reason it currently looks like this is that I previously had a different design for Volta. But the performance turned out to be bad so I cut it again.

…pects tile duplication 1. ✅ Added support for tile<16, 4> and tile<16, 8> in supported() 2. ✅ Implemented get_i/get_j for these tile sizes 3. ✅ Set ne = I*J/16 for RDNA3 input tiles (to match 4 VGPR requirement) 4. ✅ Set ne = I*J/32 for RDNA3 accumulator tiles (same as RDNA4) 5. ✅ Configured get_j to handle duplication (threads 16-31 load same as 0-15) 6. ✅ Updated half2/bfloat162 get_j to properly iterate

unverbraucht · 2025-11-29T11:17:17Z

Thanks for all the feedback. I think I fixed the issue with get_i/get_j with regards to the ne count for bf16/f16. I can reproduce the issues with mulmat test for int8 but have not been able to fix it. All quantized types FAIL for n>8 with NMSE ~1-3.

unverbraucht · 2025-11-29T12:21:46Z

@zhang-hui-yulo @jiachengjason could one of you test my PR on RDNA4 with the MUL_MAT back end ops test to see if somethink also broke for RDNA4 and I should look outside of the RDNA3 specific sections for the issue?

unverbraucht · 2025-11-29T13:24:30Z

I tried disabling tile<16,8> for RDNA3 in the supported() function but that doesn't actually change the code path since the typedefs are fixed at compile time. As fasr as I can tell both threads in the RDNA3 Wave32 duplication pattern (threads 0 and 16) load identical data, as expected. Calling the WMMA intrinsic twice (like RDNA4) - didn't fix it

I assume that a greater refactoring is needed so that other tile sizes are supported?

I could not get the WMMA intrinsics for RDNA3 (__builtin_amdgcn_wmma_i32_16x16x16_iu8_w32) to work reliably wrt egister layout and data handling. This change restricts AMD_WMMA_AVAILABLE to RDNA4 only, allowing RDNA3 to use the proven fallback MMQ/MFMA code path. Removes AMD_WMMA_INT_AVAILABLE macro as it's no longer needed.

zhang-hui-yulo · 2025-11-30T02:45:27Z

@unverbraucht Just have a quick test based on your repo, it goes into NO_DEVICE_CODE in mma.cuh::883, the last successful case is "MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1)"

I thine the case after the last successful one is failed.

unverbraucht · 2025-12-01T13:43:39Z

@zhang-hui-yulo I have disabled WMMA for RDNA3 int8 codepath because I couldn't get it to work reliably, so that's probably what you're seeing. I'm quite happy with the performance for FP16 currently. If you want to look into the int8 issues please go back to commit 9c9b0ea

JohannesGaessler

Having to deal with AMD_WMMA_AVAILABLE vs. AMD_WMMA_INT8_AVAILABLE would incur a non-negligible maintenance burden. If it comes to that I would rather not merge this PR and wait for me or someone else to make a PR that properly supports the whole feature set. If I remember correctly you previously had a version that worked correctly and changed only the FP16 and BF16 code paths without touching the int8 code path. A version like that would be acceptable to merge.

JohannesGaessler · 2025-12-01T21:22:33Z

ggml/src/ggml-cuda/common.cuh

+// WMMA is only properly supported on RDNA4, not RDNA3
+// RDNA3 has incomplete int8 WMMA support and should use fallback path


As I said before, use the term "not implemented" instead of "not supported" when talking about RDNA3 int8 WMMA in the context of llama.cpp/ggml.

JohannesGaessler · 2025-12-01T21:23:09Z

ggml/src/ggml-cuda/mma.cuh

+#else
+        // RDNA3: Accumulator uses same ne, but input tiles need more VGPRs
+        static constexpr int ne = (I == 16 && J == 16) ? (I * J / 32) : (I * J / 16);
+#endif


Suggested change

#endif

#endif // defined(RDNA4)

Please annotate all #endif statements with comments like this to indicate which #if/#ifdef they're closing.

JohannesGaessler · 2025-12-01T21:24:22Z

ggml/src/ggml-cuda/mma.cuh

        T x[ne] = {0};

        static constexpr __device__ bool supported() {
+            // Integer and FP16 WMMA are supported on RDNA3/RDNA4


Suggested change

// Integer and FP16 WMMA are supported on RDNA3/RDNA4

JohannesGaessler · 2025-12-01T21:27:55Z

ggml/src/ggml-cuda/mma.cuh

-        if constexpr (I == 16 && J == 4) {
-            int64_t * xi = (int64_t *) t.x;
-            const int64_t * xs = (int64_t *) ((const int *) xs0 + (threadIdx.x % t.I) * stride + 2 * (threadIdx.x / t.I));
-            xi[0] = xs[0];
-        }else if constexpr (I == 16 && J == 8) {
-            int64_t * xi = (int64_t *) t.x;
-            const int64_t * xs = (int64_t *) ((const int *) xs0 + (threadIdx.x % t.I) * stride + 4 * (threadIdx.x / t.I));
-            xi[0] = xs[0];
-
-            const int64_t * xs1 = (int64_t *) ((const int *) xs0 + (threadIdx.x % t.I) * stride + 4 * (threadIdx.x / t.I) + 2);
-            xi[1] = xs1[0];
-        }else{
-            NO_DEVICE_CODE;
+        // Use generic load path for all AMD WMMA to ensure correct register mapping
+#pragma unroll
+        for (int l = 0; l < t.ne; ++l) {
+            t.x[l] = xs0[t.get_i(l)*stride + t.get_j(l)];


Don't just remove the preexisting code.

JohannesGaessler · 2025-12-01T21:28:07Z

ggml/src/ggml-cuda/mma.cuh

+            t.x[l] = xs0[t.get_i(l)*stride + t.get_j(l)];
        }
-#else
+#else 


Suggested change

#else

#else

JohannesGaessler · 2025-12-01T21:32:12Z

ggml/src/ggml-cuda/mma.cuh

-        int32x2_t * a_vec = (int32x2_t *) A.x;
-        int32x2_t * b_vec = (int32x2_t *) B.x;
-
+#elif defined(AMD_WMMA_INT_AVAILABLE)


You forgot to add #define AMD_WMMA_INT_AVAILABLE in common.cuh so your PR is essentially disabling int8 WMMA for all GPUs. Also, if your int8 WMMA implementation is broken, please remove it again. This also goes for the tile sizes that you added, please keep only those that are actually being used and confirmed to work correctly.

JohannesGaessler · 2025-12-01T21:34:04Z

ggml/src/ggml-cuda/mmq.cuh

 #if defined(GGML_USE_HIP)
 static int mmq_get_nwarps_host(const int cc, const int warp_size) {
-    return amd_mfma_available(cc) ? 8 : 256/warp_size;
+    return (amd_mfma_available(cc) || amd_wmma_available(cc)) ? 8 : 256/warp_size;


Why are you changing this?

jiachengjason · 2025-12-01T21:44:33Z

@zhang-hui-yulo I have disabled WMMA for RDNA3 int8 codepath because I couldn't get it to work reliably, so that's probably what you're seeing. I'm quite happy with the performance for FP16 currently. If you want to look into the int8 issues please go back to commit 9c9b0ea

Hi @unverbraucht and @JohannesGaessler I have figured out the int8 path in my pr here #17576

zhang-hui-yulo · 2025-12-02T01:50:00Z

@zhang-hui-yulo I have disabled WMMA for RDNA3 int8 codepath because I couldn't get it to work reliably, so that's probably what you're seeing. I'm quite happy with the performance for FP16 currently. If you want to look into the int8 issues please go back to commit 9c9b0ea

Hi @unverbraucht and @JohannesGaessler I have figured out the int8 path in my pr here #17576

@unverbraucht I might suggest to move forward this PR based on @jiachengjason 's submit, seems like that this PR has met a lot of issues. Then you don't need to worry about int8.

I will also suggest to use two ggml_cuda_memcpy_1 for RDNA3 half2 then the load_generic will be much easier.

zhang hui and others added 20 commits November 7, 2025 21:22

mmf for rdna4

2f7cfcf

align the padding for rdna4

d564a35

Merge branch 'ggml-org:master' into mmf_wmma_rdna4

0ec241d

forbit mul_mat_f for rdna4

bbee5fe

Merge branch 'ggml-org:master' into mmf_wmma_rdna4

6b8ceeb

fix as comment

fd18344

remove device kernels

7a09e22

add constexpr for early return

c65dd59

Merge branch 'ggml-org:master' into mmf_wmma_rdna4

48a53b5

update based on review comment

b7c13ee

change based on the review comment

a0aa491

Merge branch 'ggml-org:master' into mmf_wmma_rdna4

8c2f9a3

pass compile error

7a88d7c

Merge branch 'ggml-org:master' into mmf_wmma_rdna4

cfc149a

keep code consistency

59a012f

Merge branch 'ggml-org:master' into mmf_wmma_rdna4

6802fbf

Merge branch 'ggml-org:master' into mmf_wmma_rdna4

facded5

Merge branch 'master' into mmf_wmma_rdna3

ba25661

WMMA RDNA3 fixes

edb86ef

unverbraucht requested review from JohannesGaessler and am17an as code owners November 25, 2025 14:35

loci-dev mentioned this pull request Nov 25, 2025

UPSTREAM PR #17495: HIP: Add RDNA3 WMMA support to MMF auroralabs-loci/llama.cpp#319

Open

fix RDNA3 not using the fast DP4A-based MMQ path. RDNA4 should still …

5a80fb4

…use MMQ with integer WMMA operations (hardware-accelerated)

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 25, 2025

unverbraucht marked this pull request as draft November 25, 2025 16:27

more fixes for RDNA3 crashes with quantized models

8aed111

JohannesGaessler reviewed Nov 26, 2025

View reviewed changes

Kevin Read added 2 commits November 27, 2025 11:35

fix endif commets

c692629

unverbraucht marked this pull request as ready for review November 27, 2025 11:54

JohannesGaessler reviewed Dec 1, 2025

View reviewed changes

		// WMMA is only properly supported on RDNA4, not RDNA3
		// RDNA3 has incomplete int8 WMMA support and should use fallback path

HIP: Add RDNA3 WMMA support to MMF #17495

Are you sure you want to change the base?

HIP: Add RDNA3 WMMA support to MMF #17495

Conversation

unverbraucht commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Nov 25, 2025

Uh oh!

unverbraucht commented Nov 25, 2025

Uh oh!

hjc4869 commented Nov 25, 2025

Uh oh!

jiachengjason commented Nov 25, 2025

Uh oh!

unverbraucht commented Nov 25, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

unverbraucht commented Nov 27, 2025

Uh oh!

JohannesGaessler commented Nov 27, 2025

Uh oh!

unverbraucht commented Nov 27, 2025

Uh oh!

jiachengjason commented Nov 27, 2025

Uh oh!

zhang-hui-yulo commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Nov 28, 2025

Uh oh!

zhang-hui-yulo commented Nov 29, 2025

Uh oh!

JohannesGaessler commented Nov 29, 2025

Uh oh!

unverbraucht commented Nov 29, 2025

Uh oh!

unverbraucht commented Nov 29, 2025

Uh oh!

unverbraucht commented Nov 29, 2025

Uh oh!

zhang-hui-yulo commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

unverbraucht commented Dec 1, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

jiachengjason commented Dec 1, 2025

Uh oh!

zhang-hui-yulo commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

unverbraucht commented Nov 25, 2025 •

edited

Loading

zhang-hui-yulo commented Nov 28, 2025 •

edited

Loading

zhang-hui-yulo commented Nov 30, 2025 •

edited

Loading

zhang-hui-yulo commented Dec 2, 2025 •

edited

Loading