Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Task 1: CUDA Programming

High performance gemm implementation on Nvidia A100 (internal feishu doc).

1. Target

Implement a high performance gemm (General Matrix Multiply) function with CUDA on Nvidia A100 for float32 and float16 data types.

The implementation should be able to achieve at least 90% of the performance of cuBLAS, with the given benchmarking structure.

2. Quick Start with Bash Script

We provide a convenient bash script task1.sh that offers the same operations as the previous Makefile:

# Show all available commands
./task1.sh help

# 1) Build code with specified FLOAT type and VERSION
./task1.sh build --float f32 --ver 1

# 2) Build and run code, automatically save logs
./task1.sh run --float f16 --ver 1

# 3) Build with debug symbols (RelWithDebInfo)
./task1.sh debug --float f32 --ver 2

# 4) Profile with nsight compute, save reports
./task1.sh profile --float f16 --ver 2

# Clean build files
./task1.sh clean

# Clean log files  
./task1.sh clean-logs

Key Features:

  • Automatic Logging: Run results are saved to logs/ directory with timestamp and version info
  • TFLOPS and Error Tracking: Captures performance metrics and error rates automatically
  • Nsight Compute Integration: Profile reports saved to logs/profiles/ with timestamp
  • Version-Specific File Inclusion: Only includes source files for the current version to avoid conflicts

3. Benchmark cBlas and cuBlas

Build example matmul with the following commands (v0 -> cblas; v1 -> cublas):

Using build script directly:

# Build gemm implemented with CBLAS (CPU) under float32:
bash scripts/build-task1.sh -f32 -v0
# Build gemm implemented with CBLAS (CPU) under float16:
bash scripts/build-task1.sh -f16 -v0
# Build gemm implemented with cublas (CUDA) under float32:
bash scripts/build-task1.sh -f32 -v1
# Build gemm implemented with cublas (CUDA) under float16:
bash scripts/build-task1.sh -f16 -v1

Using task1.sh script (Recommended):

# Build gemm implemented with CBLAS (CPU) under float32:
./task1.sh build --float f32 --ver 0
# Build gemm implemented with CBLAS (CPU) under float16:
./task1.sh build --float f16 --ver 0
# Build gemm implemented with cublas (CUDA) under float32:
./task1.sh build --float f32 --ver 1
# Build gemm implemented with cublas (CUDA) under float16:
./task1.sh build --float f16 --ver 1

For more compile options, see "./scripts/build-task1.sh" or run ./task1.sh help.

💡Note:

  1. Please install the following extensions in VSCode:
    • llvm-vs-code-extensions.vscode-clangd
    • twxs.cmake
    • josetr.cmake-language-support-vscode
  2. It is suggested to restart clangd server after building (to avoid some code analysis errors).
    To restart clangd server, press Ctrl+Shift+P in VSCode, and select clangd: Restart language server.
    restart-clangd

Run the binarys in "./build/src" directory to get the benchmark results.

Running directly:

You can set m, n, k, n_warmup and n_test by passing arguments to binarys built in this task. Use -h to print help messages:

# Run the binary but showing help messages only
./build/src/task1_float16_v0 -h

Running with task1.sh script (Recommended):

# Build and run with automatic logging
./task1.sh run --float f16 --ver 0

# Run with different configurations
./task1.sh run --float f32 --ver 1
./task1.sh run --float f16 --ver 1

The run results will be automatically saved to logs/ directory with timestamp and version information.

3. Add Your Own Implementation

Create a .cu file under directory "./task-1/src" with any name you like, and implement a matmul function with macro PLAYGROUND_MATMUL_DEC.

For example, add the following lines in "./task-1/src/xxx/xxx/f16-v2.cu" to provide the definition for function matmul<float16_t, 2>:

// @file: ./task-1/src/xxx/xxx/f16-v2.cu

#include "playground/matmul.hpp"

namespace playground {
// Implement the matmul function with DType=float16_t and Version=2
PLAYGROUND_MATMUL_DEC(float16_t, 2, A, B, C, M, N, K)
{
    // ......
}
}

💡Note:
Do not use version 0 and 1 because they are for cblas and cublas respectively.

Now you are able to build a new binary task1_float16_v2 to with the following command:

Using build script directly:

# Build the test binary with DType=float16 and Version=2:
bash ./scripts/build-task1.sh -v2 -f16
# Run the test binary
./build/src/task1_float16_v2

Using task1.sh script (Recommended):

# Build and run with automatic logging
./task1.sh run --float f16 --ver 2

4. Profile Your Kernel with Nsight Compute

Use "scripts/nsight-profile.sh" to profile an binary which contains a self-defined cuda kernel.

⚠️ The profiled binary must be built with RelWithDebInfo or RD flag.

Using build script directly:

For example, to build matmul kernel with DType=float16, Version=2 and RD flag:

# `RD` is the same as `RelWithDebInfo`
bash ./scripts/build-task1.sh RD -f16 -v2 

Then you can profile the binary with ncu with a tool script:

bash ./scripts/nsight-profile.sh -t build/src/task1_float16_v2

Using task1.sh script (Recommended):

# Build with debug symbols and profile automatically
./task1.sh profile --float f16 --ver 2

A .ncu-rep file will be generated in the logs/profiles/ directory with timestamp. Download it to your local machine and open it with Nsight Compute GUI.

ncu-example

5. Example Target Results

CUDA Core(FP32)

Version v0 v1 v2 v3 v4 cuBLAS Theory Peak
Average error 0.0115 0.0115 0.0115 0.0116 0.0116 / /
TFLOPS 2.41 3.85 9.24 15.15 17.16 18.38 19.5

Tensor Core(FP16)

Version v0 v1 v2 v3 v4 cuBLAS Theory Peak
Average error 0.0117 0.0117 0.0117 0.0117 0.0019 0.0153 /
TFLOPS 18.09 53.05 103.05 159.35 213.12 222.11 312

💡Note:
Some card can reach above 250 TFLOPS using cuBLAS fp16. The target is the 90% of cuBLAS on the same card

6. References

See also: feishu doc: cuda学习资料

CUDA Core

Tensor Core

Source code:

Asynchronous data copy:

Multi-buffer with prefetching:

Permute to use memory coalescing and avoid bank conflicts:

For Further Study