Name	Name	Last commit message	Last commit date
parent directory ..
cmake	cmake
include/playground	include/playground
src	src
CMakeLists.txt	CMakeLists.txt
README.md	README.md
vcpkg.json	vcpkg.json

Task 1: CUDA Programming

High performance gemm implementation on Nvidia A100 (internal feishu doc).

1. Target

Implement a high performance gemm (General Matrix Multiply) function with CUDA on Nvidia A100 for float32 and float16 data types.

The implementation should be able to achieve at least 90% of the performance of cuBLAS, with the given benchmarking structure.

2. Quick Start with Bash Script

We provide a convenient bash script task1.sh that offers the same operations as the previous Makefile:

# Show all available commands
./task1.sh help

# 1) Build code with specified FLOAT type and VERSION
./task1.sh build --float f32 --ver 1

# 2) Build and run code, automatically save logs
./task1.sh run --float f16 --ver 1

# 3) Build with debug symbols (RelWithDebInfo)
./task1.sh debug --float f32 --ver 2

# 4) Profile with nsight compute, save reports
./task1.sh profile --float f16 --ver 2

# Clean build files
./task1.sh clean

# Clean log files  
./task1.sh clean-logs

Key Features:

Automatic Logging: Run results are saved to logs/ directory with timestamp and version info
TFLOPS and Error Tracking: Captures performance metrics and error rates automatically
Nsight Compute Integration: Profile reports saved to logs/profiles/ with timestamp
Version-Specific File Inclusion: Only includes source files for the current version to avoid conflicts

3. Benchmark cBlas and cuBlas

Build example matmul with the following commands (v0 -> cblas; v1 -> cublas):

Using build script directly:

# Build gemm implemented with CBLAS (CPU) under float32:
bash scripts/build-task1.sh -f32 -v0
# Build gemm implemented with CBLAS (CPU) under float16:
bash scripts/build-task1.sh -f16 -v0
# Build gemm implemented with cublas (CUDA) under float32:
bash scripts/build-task1.sh -f32 -v1
# Build gemm implemented with cublas (CUDA) under float16:
bash scripts/build-task1.sh -f16 -v1

Using task1.sh script (Recommended):

# Build gemm implemented with CBLAS (CPU) under float32:
./task1.sh build --float f32 --ver 0
# Build gemm implemented with CBLAS (CPU) under float16:
./task1.sh build --float f16 --ver 0
# Build gemm implemented with cublas (CUDA) under float32:
./task1.sh build --float f32 --ver 1
# Build gemm implemented with cublas (CUDA) under float16:
./task1.sh build --float f16 --ver 1

For more compile options, see "./scripts/build-task1.sh" or run ./task1.sh help.

💡Note:

Please install the following extensions in VSCode:

llvm-vs-code-extensions.vscode-clangd

twxs.cmake

josetr.cmake-language-support-vscode

It is suggested to restart clangd server after building (to avoid some code analysis errors).
To restart clangd server, press Ctrl+Shift+P in VSCode, and select clangd: Restart language server.

Run the binarys in "./build/src" directory to get the benchmark results.

Running directly:

You can set m, n, k, n_warmup and n_test by passing arguments to binarys built in this task. Use -h to print help messages:

# Run the binary but showing help messages only
./build/src/task1_float16_v0 -h

Running with task1.sh script (Recommended):

# Build and run with automatic logging
./task1.sh run --float f16 --ver 0

# Run with different configurations
./task1.sh run --float f32 --ver 1
./task1.sh run --float f16 --ver 1

The run results will be automatically saved to logs/ directory with timestamp and version information.

3. Add Your Own Implementation

Create a .cu file under directory "./task-1/src" with any name you like, and implement a matmul function with macro PLAYGROUND_MATMUL_DEC.

For example, add the following lines in "./task-1/src/xxx/xxx/f16-v2.cu" to provide the definition for function matmul<float16_t, 2>:

// @file: ./task-1/src/xxx/xxx/f16-v2.cu

#include "playground/matmul.hpp"

namespace playground {
// Implement the matmul function with DType=float16_t and Version=2
PLAYGROUND_MATMUL_DEC(float16_t, 2, A, B, C, M, N, K)
{
    // ......
}
}

💡Note:
Do not use version 0 and 1 because they are for cblas and cublas respectively.

Now you are able to build a new binary task1_float16_v2 to with the following command:

Using build script directly:

# Build the test binary with DType=float16 and Version=2:
bash ./scripts/build-task1.sh -v2 -f16
# Run the test binary
./build/src/task1_float16_v2

Using task1.sh script (Recommended):

# Build and run with automatic logging
./task1.sh run --float f16 --ver 2

4. Profile Your Kernel with Nsight Compute

Use "scripts/nsight-profile.sh" to profile an binary which contains a self-defined cuda kernel.

⚠️ The profiled binary must be built with RelWithDebInfo or RD flag.

Using build script directly:

For example, to build matmul kernel with DType=float16, Version=2 and RD flag:

# `RD` is the same as `RelWithDebInfo`
bash ./scripts/build-task1.sh RD -f16 -v2

Then you can profile the binary with ncu with a tool script:

bash ./scripts/nsight-profile.sh -t build/src/task1_float16_v2

Using task1.sh script (Recommended):

# Build with debug symbols and profile automatically
./task1.sh profile --float f16 --ver 2

A .ncu-rep file will be generated in the logs/profiles/ directory with timestamp. Download it to your local machine and open it with Nsight Compute GUI.

5. Example Target Results

CUDA Core(FP32)

Version	v0	v1	v2	v3	v4	cuBLAS	Theory Peak
Average error	0.0115	0.0115	0.0115	0.0116	0.0116	/	/
TFLOPS	2.41	3.85	9.24	15.15	17.16	18.38	19.5

Tensor Core(FP16)

Version	v0	v1	v2	v3	v4	cuBLAS	Theory Peak
Average error	0.0117	0.0117	0.0117	0.0117	0.0019	0.0153	/
TFLOPS	18.09	53.05	103.05	159.35	213.12	222.11	312

💡Note:
Some card can reach above 250 TFLOPS using cuBLAS fp16. The target is the 90% of cuBLAS on the same card

6. References

CUDA Core

"Programming Massively Parallel Processors A Hands-on Approach (Fourth Edition)" Chapter 2-3
"Programming Massively Parallel Processors A Hands-on Approach (Fourth Edition)" Chapter 4-5
CUDA编程入门及优化 1.2 Thread Block Tile: 利用 Shared Memory 减少重复访存
"Programming Massively Parallel Processors A Hands-on Approach (Fourth Edition)" Chapter 6-6.3 Thread coarsening
how-to-optimize-gemm MMult_cuda_4 & MMult_cuda_5
CUDA 矩阵乘法终极优化指南 Naive 实现的分析：到底差在哪里？
"Programming Massively Parallel Processors A Hands-on Approach (Fourth Edition)" Chapter 6-6.1 Memory coalescing, 6.2 Hiding memory latency
how-to-optimize-gemm MMult_cuda_9
cuda/MMult_cuda_9.cu
CUDA 矩阵乘法终极优化指南极致的访存优化
CUDA编程入门及优化 1.3 Warp Tile 与 Thread Tile: 利用寄存器消除 Shared Memory 瓶颈
how-to-optimize-gemm MMult_cuda_12
cuda/MMult_cuda_12.cu
CUDA编程入门及优化 1.4 Double Buffer: 让 GEMM 流水并行起来

Tensor Core

cuda学习：学习nvcuda::wmma实现高效gemm simple version
cuda学习：学习nvcuda::wmma实现高效gemm sample version with detailed annotations
Official sample provided by NVIDIA
Nvidia Tensor Core-CUDA HGEMM优化进阶 4.5 提高L2 Cache命中率
一步步优化 GEMM by Tensorcore 调整线程块分配到的计算位置(swizzle)

Source code:

src/wmma/wmma_async_stage3.cu 3 stages pipeline with WMMA API

Asynchronous data copy:

Data Movement and Conversion Instructions: cp.async To know the usage of cp.async instructions
Performance Guidance for memcpy_async To know the usage of asynchronous data copy

Multi-buffer with prefetching:

Nvidia Tensor Core-CUDA HGEMM优化进阶 5 Pipeline优化-5.2 Stage
一步步优化 GEMM by Tensorcore 使用数据预取(prefetch)

Permute to use memory coalescing and avoid bank conflicts:

cuda（cutlass）编程之swizzle A more detailed video explanation of swizzle based on CUTLASS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Task 1: CUDA Programming

1. Target

2. Quick Start with Bash Script

Key Features:

3. Benchmark cBlas and cuBlas

Using build script directly:

Using task1.sh script (Recommended):

Running directly:

Running with task1.sh script (Recommended):

3. Add Your Own Implementation

Using build script directly:

Using task1.sh script (Recommended):

4. Profile Your Kernel with Nsight Compute

Using build script directly:

Using task1.sh script (Recommended):

5. Example Target Results

CUDA Core(FP32)

Tensor Core(FP16)

6. References

CUDA Core

Tensor Core

Source code:

Multi-buffer with prefetching:

Permute to use memory coalescing and avoid bank conflicts:

For Further Study

FilesExpand file tree

task-1

Directory actions

More options

Directory actions

More options

Latest commit

History

task-1

Folders and files

parent directory

README.md

Task 1: CUDA Programming

1. Target

2. Quick Start with Bash Script

Key Features:

3. Benchmark cBlas and cuBlas

Using build script directly:

Using task1.sh script (Recommended):

Running directly:

Running with task1.sh script (Recommended):

3. Add Your Own Implementation

Using build script directly:

Using task1.sh script (Recommended):

4. Profile Your Kernel with Nsight Compute

Using build script directly:

Using task1.sh script (Recommended):

5. Example Target Results

CUDA Core(FP32)

Tensor Core(FP16)

6. References

CUDA Core

Tensor Core

Source code:

Multi-buffer with prefetching:

Permute to use memory coalescing and avoid bank conflicts:

For Further Study