-
Notifications
You must be signed in to change notification settings - Fork 13.9k
HIP: enable WMMA-MMQ INT kernels for RDNA 3 #17576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
HIP: enable WMMA-MMQ INT kernels for RDNA 3 #17576
Conversation
| static constexpr int ne = I * J / 32; | ||
| #elif defined(RDNA3) | ||
| static constexpr int ne = (I == 16 && J == 16) ? I * J / 32 : I * J / 16; | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #endif | |
| #endif // defined(RDNA4) |
Please add comments to indicate which #if/#ifdef and #endif is closing.
| if (GGML_CUDA_CC_IS_RDNA4(cc) || GGML_CUDA_CC_IS_RDNA3(cc)) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (GGML_CUDA_CC_IS_RDNA4(cc) || GGML_CUDA_CC_IS_RDNA3(cc)) { | |
| return true; | |
| } | |
| return true; |
| A1.x[0] = 0x01010101; | ||
| A1.x[1] = 0x01010101; | ||
| A1.x[2] = 0x01010101; | ||
| A1.x[3] = 0x01010101; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| A1.x[0] = 0x01010101; | |
| A1.x[1] = 0x01010101; | |
| A1.x[2] = 0x01010101; | |
| A1.x[3] = 0x01010101; | |
| #pragma unroll | |
| for (int l = 0; l < tile_A::ne; ++l) { | |
| A1.x[l] = 0x01010101; | |
| } |
To my understanding tile_A has 4 elements for RDNA3 but for RDNA4 it only has 2 elements. So as it is this would result in out-of-bounds writes and potential memory trampling for RDNA4.
Enabled WMMA-MMQ INT kernels for RDNA 3 architecture on AMD GPUs
Following similar approach to #17156
Using ./build/bin/llama-bench to collect the following performance results
Performance results with ggml/llama.cpp master commit up to/includes ab49f09
Build command for the following performance results:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_HIP=ON -DGGML_CUDA_FORCE_MMQ=OFF -DGGML_HIP_UMA=OFF -DGGML_HIP_ROCWMMA_FATTN=OFF -DGPU_TARGETS="gfx1100" -DGGML_HIP_GRAPHS=OFF -DLLAMA_CURL=OFF -DGGML_CUDA_FORCE_CUBLAS=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 32