Skip to content

[v0.2.7] Multi-SM CUTLASS kernels with runtime dispatch #68

@m96-chan

Description

@m96-chan

Summary

Implement SM-specific CUTLASS kernel variants with runtime dispatch for optimal performance across GPU architectures.

Motivation

  • SM 80 (A100), SM 86 (RTX 30xx), SM 89 (RTX 40xx), SM 90 (H100) have different optimal configurations
  • Single kernel compiled for lowest SM leaves performance on the table
  • Runtime dispatch allows shipping one wheel that works optimally on all GPUs

Proposed Architecture

Wheel Structure

pygpukit/
 ├─ core/                  # Rust core
 ├─ native/
 │   ├─ ops/
 │   │   ├─ matmul_cutlass_sm80.cu   # Ampere (A100)
 │   │   ├─ matmul_cutlass_sm86.cu   # Ampere (RTX 30xx)
 │   │   ├─ matmul_cutlass_sm90.cu   # Hopper (H100)
 │   │   └─ matmul_fallback.cu       # SIMT fallback

Runtime Selector (Rust)

match device.sm {
    90.. => kernel_sm90,   // Hopper+
    86.. => kernel_sm86,   // RTX 30xx, RTX 40xx
    80.. => kernel_sm80,   // A100
    _    => fallback,      // Pre-Ampere (not officially supported)
}

Benefits

  1. Optimal performance - Each SM gets tuned tile sizes, pipeline depth
  2. Single wheel - No need for separate builds per architecture
  3. Future-proof - Easy to add SM 100+ when available

Implementation Notes

  • Each SM variant compiles with -arch=sm_XX
  • Fallback uses SIMT (no TensorCore) for unsupported architectures
  • Runtime detection via cudaGetDeviceProperties

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions