Skip to content

ipengx1029/cutlass-notes

 
 

Repository files navigation

CUTLASS Notes

The CUTLASS notes series will begin with a minimal GEMM implementation, gradually expand to incorporate CuTe and various CUTLASS components, as well as features of new architectures, e.g. Hopper and Blackwell, ultimately achieving a high-performance fused GEMM operator.

Usage

git clone https://github.com/ArthurinRUC/cutlass-notes.git
# clone cutlass
cd cutlass-notes
git submodule update --init --recursive

Run sample code

All example code in this GitHub repository can be compiled and run by simply executing the Python script. For example:

cd 01-minimal-gemm
python minimal_gemm.py

Note list

Notes Summary Links
00-Intro Brief introduction to CUTLASS intro
01-minimal-gemm
  • Introduces CuTe fundamentals
  • Implements 16x8x8 GEMM kernel using single MMA instruction from scratch
  • Python kernel invocation, precision validation & performance benchmarking
  • Profiling with Nsight Compute (ncu)
  • minimal-gemm
    02-mixed-precision-gemm
  • Implements mixed-precision GEMM supporting varying input/output/accumulation precisions
  • Explores technical details for numerical precision conversion within kernels
  • Demonstrates custom FP8 GEMM kernel implementation via PTX instructions (for CUTLASS-unsupported MMA ops)
  • mixed-precision-gemm
    03-tiled-mma
  • Introduces the key conceptual model of GEMM operator: Three-Level Tiling
  • Details the implementation of Tiled MMA operations in CUTLASS CuTe
  • Explains the usage and semantics of various parameters in the Tiled MMA API
  • Extends the GEMM kernel from single instruction to single tile operation
  • tiled-mma
    04-tiled-copy Coming soon Stay tuned

    License

    This project is licensed under the MIT License - see the LICENSE file for details.

    About

    From Minimal GEMM to Everything

    Resources

    License

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published

    Languages

    • Cuda 78.8%
    • Python 19.9%
    • Shell 1.3%