Skip to content

Conversation

@jiachengjason
Copy link
Contributor

@jiachengjason jiachengjason commented Nov 28, 2025

Enabled WMMA-MMQ INT kernels for RDNA 3 architecture on AMD GPUs

Following similar approach to #17156

Using ./build/bin/llama-bench to collect the following performance results

Performance results with ggml/llama.cpp master commit up to/includes ab49f09

Build command for the following performance results:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_HIP=ON -DGGML_CUDA_FORCE_MMQ=OFF -DGGML_HIP_UMA=OFF -DGGML_HIP_ROCWMMA_FATTN=OFF -DGPU_TARGETS="gfx1100" -DGGML_HIP_GRAPHS=OFF -DLLAMA_CURL=OFF -DGGML_CUDA_FORCE_CUBLAS=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 32

image image

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2025
@jiachengjason jiachengjason marked this pull request as ready for review December 1, 2025 21:36
static constexpr int ne = I * J / 32;
#elif defined(RDNA3)
static constexpr int ne = (I == 16 && J == 16) ? I * J / 32 : I * J / 16;
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#endif
#endif // defined(RDNA4)

Please add comments to indicate which #if/#ifdef and #endif is closing.

Comment on lines +310 to 312
if (GGML_CUDA_CC_IS_RDNA4(cc) || GGML_CUDA_CC_IS_RDNA3(cc)) {
return true;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (GGML_CUDA_CC_IS_RDNA4(cc) || GGML_CUDA_CC_IS_RDNA3(cc)) {
return true;
}
return true;

Comment on lines 1545 to +1548
A1.x[0] = 0x01010101;
A1.x[1] = 0x01010101;
A1.x[2] = 0x01010101;
A1.x[3] = 0x01010101;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A1.x[0] = 0x01010101;
A1.x[1] = 0x01010101;
A1.x[2] = 0x01010101;
A1.x[3] = 0x01010101;
#pragma unroll
for (int l = 0; l < tile_A::ne; ++l) {
A1.x[l] = 0x01010101;
}

To my understanding tile_A has 4 elements for RDNA3 but for RDNA4 it only has 2 elements. So as it is this would result in out-of-bounds writes and potential memory trampling for RDNA4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants