High performance gemm implementation on Nvidia A100 (internal feishu doc).
Implement a high performance gemm (General Matrix Multiply) function with CUDA on Nvidia A100 for float32 and float16 data types.
The implementation should be able to achieve at least 90% of the performance of cuBLAS, with the given benchmarking structure.
We provide a convenient bash script task1.sh that offers the same operations as the previous Makefile:
# Show all available commands
./task1.sh help
# 1) Build code with specified FLOAT type and VERSION
./task1.sh build --float f32 --ver 1
# 2) Build and run code, automatically save logs
./task1.sh run --float f16 --ver 1
# 3) Build with debug symbols (RelWithDebInfo)
./task1.sh debug --float f32 --ver 2
# 4) Profile with nsight compute, save reports
./task1.sh profile --float f16 --ver 2
# Clean build files
./task1.sh clean
# Clean log files
./task1.sh clean-logs- Automatic Logging: Run results are saved to
logs/directory with timestamp and version info - TFLOPS and Error Tracking: Captures performance metrics and error rates automatically
- Nsight Compute Integration: Profile reports saved to
logs/profiles/with timestamp - Version-Specific File Inclusion: Only includes source files for the current version to avoid conflicts
Build example matmul with the following commands (v0 -> cblas; v1 -> cublas):
# Build gemm implemented with CBLAS (CPU) under float32:
bash scripts/build-task1.sh -f32 -v0
# Build gemm implemented with CBLAS (CPU) under float16:
bash scripts/build-task1.sh -f16 -v0
# Build gemm implemented with cublas (CUDA) under float32:
bash scripts/build-task1.sh -f32 -v1
# Build gemm implemented with cublas (CUDA) under float16:
bash scripts/build-task1.sh -f16 -v1# Build gemm implemented with CBLAS (CPU) under float32:
./task1.sh build --float f32 --ver 0
# Build gemm implemented with CBLAS (CPU) under float16:
./task1.sh build --float f16 --ver 0
# Build gemm implemented with cublas (CUDA) under float32:
./task1.sh build --float f32 --ver 1
# Build gemm implemented with cublas (CUDA) under float16:
./task1.sh build --float f16 --ver 1For more compile options, see "./scripts/build-task1.sh" or run ./task1.sh help.
💡Note:
- Please install the following extensions in VSCode:
- llvm-vs-code-extensions.vscode-clangd
- twxs.cmake
- josetr.cmake-language-support-vscode
- It is suggested to restart clangd server after building (to avoid some code analysis errors).
To restart clangd server, pressCtrl+Shift+Pin VSCode, and selectclangd: Restart language server.
Run the binarys in "./build/src" directory to get the benchmark results.
You can set m, n, k, n_warmup and n_test by passing arguments to binarys built in this task. Use -h to print help messages:
# Run the binary but showing help messages only
./build/src/task1_float16_v0 -h# Build and run with automatic logging
./task1.sh run --float f16 --ver 0
# Run with different configurations
./task1.sh run --float f32 --ver 1
./task1.sh run --float f16 --ver 1The run results will be automatically saved to logs/ directory with timestamp and version information.
Create a .cu file under directory "./task-1/src" with any name you like, and implement a matmul function with macro PLAYGROUND_MATMUL_DEC.
For example, add the following lines in "./task-1/src/xxx/xxx/f16-v2.cu" to provide the definition for function matmul<float16_t, 2>:
// @file: ./task-1/src/xxx/xxx/f16-v2.cu
#include "playground/matmul.hpp"
namespace playground {
// Implement the matmul function with DType=float16_t and Version=2
PLAYGROUND_MATMUL_DEC(float16_t, 2, A, B, C, M, N, K)
{
// ......
}
}💡Note:
Do not use version0and1because they are for cblas and cublas respectively.
Now you are able to build a new binary task1_float16_v2 to with the following command:
# Build the test binary with DType=float16 and Version=2:
bash ./scripts/build-task1.sh -v2 -f16
# Run the test binary
./build/src/task1_float16_v2# Build and run with automatic logging
./task1.sh run --float f16 --ver 2Use "scripts/nsight-profile.sh" to profile an binary which contains a self-defined cuda kernel.
RelWithDebInfo or RD flag.
For example, to build matmul kernel with DType=float16, Version=2 and RD flag:
# `RD` is the same as `RelWithDebInfo`
bash ./scripts/build-task1.sh RD -f16 -v2 Then you can profile the binary with ncu with a tool script:
bash ./scripts/nsight-profile.sh -t build/src/task1_float16_v2# Build with debug symbols and profile automatically
./task1.sh profile --float f16 --ver 2A .ncu-rep file will be generated in the logs/profiles/ directory with timestamp. Download it to your local machine and open it with Nsight Compute GUI.
| Version | v0 | v1 | v2 | v3 | v4 | cuBLAS | Theory Peak |
|---|---|---|---|---|---|---|---|
| Average error | 0.0115 | 0.0115 | 0.0115 | 0.0116 | 0.0116 | / | / |
| TFLOPS | 2.41 | 3.85 | 9.24 | 15.15 | 17.16 | 18.38 | 19.5 |
| Version | v0 | v1 | v2 | v3 | v4 | cuBLAS | Theory Peak |
|---|---|---|---|---|---|---|---|
| Average error | 0.0117 | 0.0117 | 0.0117 | 0.0117 | 0.0019 | 0.0153 | / |
| TFLOPS | 18.09 | 53.05 | 103.05 | 159.35 | 213.12 | 222.11 | 312 |
💡Note:
Some card can reach above 250 TFLOPS using cuBLAS fp16. The target is the 90% of cuBLAS on the same card
See also: feishu doc: cuda学习资料
-
"Programming Massively Parallel Processors A Hands-on Approach (Fourth Edition)" Chapter 2-3
-
"Programming Massively Parallel Processors A Hands-on Approach (Fourth Edition)" Chapter 4-5
-
CUDA编程入门及优化 1.2 Thread Block Tile: 利用 Shared Memory 减少重复访存
-
"Programming Massively Parallel Processors A Hands-on Approach (Fourth Edition)" Chapter 6-6.3 Thread coarsening
-
how-to-optimize-gemm MMult_cuda_4 & MMult_cuda_5
-
CUDA 矩阵乘法终极优化指南 Naive 实现的分析:到底差在哪里?
-
"Programming Massively Parallel Processors A Hands-on Approach (Fourth Edition)" Chapter 6-6.1 Memory coalescing, 6.2 Hiding memory latency
-
how-to-optimize-gemm MMult_cuda_9
-
CUDA 矩阵乘法终极优化指南 极致的访存优化
-
CUDA编程入门及优化 1.3 Warp Tile 与 Thread Tile: 利用寄存器消除 Shared Memory 瓶颈
-
how-to-optimize-gemm MMult_cuda_12
-
CUDA编程入门及优化 1.4 Double Buffer: 让 GEMM 流水并行起来
-
cuda学习:学习nvcuda::wmma实现高效gemm simple version
-
cuda学习:学习nvcuda::wmma实现高效gemm sample version with detailed annotations
-
Nvidia Tensor Core-CUDA HGEMM优化进阶 4.5 提高L2 Cache命中率
-
一步步优化 GEMM by Tensorcore 调整线程块分配到的计算位置(swizzle)
- src/wmma/wmma_async_stage3.cu 3 stages pipeline with WMMA API
Asynchronous data copy:
- Data Movement and Conversion Instructions: cp.async To know the usage of cp.async instructions
- Performance Guidance for memcpy_async To know the usage of asynchronous data copy
- Nvidia Tensor Core-CUDA HGEMM优化进阶 5 Pipeline优化-5.2 Stage
- 一步步优化 GEMM by Tensorcore 使用数据预取(prefetch)
- cuda(cutlass)编程之swizzle A more detailed video explanation of swizzle based on CUTLASS

