Skip to content

PyGPUkit v0.2.3 — TF32 TensorCore GEMM Implementation (22–30 TFLOPS Target) #41

@m96-chan

Description

@m96-chan

🎯 Summary

This milestone introduces TF32 TensorCore acceleration for Ampere+ GPUs (RTX 30-series, A100, H100).
The goal is to surpass the current FP32 FMA kernel (18.3 TFLOPS) and reach:

22–30 TFLOPS on RTX 3090 Ti

40–75 TFLOPS on A100

90–150 TFLOPS on H100 (TF32x3)

This work prepares the foundation for FP16/BF16 and full LLM inference support.

📌 Goals
Item Target
Implement TF32 TensorCore GEMM ✔
Integrate into PyGPUkit kernel dispatcher ✔
Achieve 22–30 TFLOPS on 3090Ti ✔(目標)
Maintain deterministic correctness ✔
Provide Nsight Compute profiles Optional
A100/H100 scalable design ✔
🧠 Technical Requirements
TF32 TensorCore (mma.sync.aligned.m16n8k8.row.col.tf32.tf32.f32)

WMMA API not used
→ we will use inline PTX with mma.sync

Tile base: BM=128, BN=128, BK=32

Warp tile: 16×16 output per warp

Warp groups: 8 warps per block

Fragment A/B: TF32 inputs, accumulated in FP32

Shared memory swizzling required

🧱 Tasks

  1. Kernel Design (TF32 MMA Kernel)

File: native/ops/matmul_f32_tf32.cuh

Implement warp-level mma.sync microkernel

Implement ldmatrix.sync for A/B fragment loading

Implement 4-stage cp.async pipeline

Shared memory swizzling layout for conflict-free loads

BK=32 tuning for Ampere SMs

128×128 block decomposition (8 warps/block)

  1. Kernel Dispatcher Integration

File: native/ops/matmul_dispatch.cpp

Add selection logic:

if (GPU >= Ampere && dtype == float32 && allow_tensorcore) {
if (M,N,K >= 1024) return TensorCoreTF32;
}

Add runtime flag PYGPUKIT_ALLOW_TF32=1

Add Python API: matmul(..., use_tf32=True)

Fallback to FMA kernel for small matrices

  1. Rust Backend Updates

Files:
rust/pygpukit-core/src/ops/matmul.rs
rust/pygpukit-python/src/matmul.rs

Add TF32 kernel ID

Add DeviceCapabilities { tensorcore: bool }

Add execution path for tensorcore kernels

Pass alignment & tiling info to native layer

  1. Testing & Validation

Files: tests/test_tf32.py

Correctness test (relative error < 1e-2)

Performance test thresholds:

Size Minimum TFLOPS
2048² 15 TFLOPS
4096² 22 TFLOPS
8192² 28 TFLOPS

Compare against PyTorch/cuBLAS

Stress test non-square matrices

  1. A100/H100 Scaling Support

Query SM count, tensor core per SM

Tune warp-tile count per block

Optional: H100 TF32x3 mode (3× throughput)

Provide benchmark scripts for each GPU family

  1. Documentation

File: docs/tf32_tensorcore_design.md

Include:

Ampere Tensor Core architecture

mma.sync instruction format

tiling diagrams

shared memory layout

performance tuning strategy

🛰️ Milestone Deliverables

✔ TF32 TensorCore GEMM kernel

✔ Unified matmul dispatcher

✔ Performance guarantees:

22+ TFLOPS on 3090Ti

40+ TFLOPS on A100

90+ TFLOPS on H100

✔ Benchmarks & Nsight Compute screenshots

✔ Documentation

✔ Python API exposure

🧭 Stretch Goals

Hopper TF32x3 mode

CUTLASS-like epilogue fusion (bias + GELU)

Auto-tuner for BM/BN/BK

Kernel cache for compiled PTX

🏁 Acceptance Criteria

The milestone is considered done when:

✔ All correctness tests pass

✔ 4096×4096 TF32 matmul > 22 TFLOPS

✔ No correctness drift observed over 500 iterations

✔ Kernel dispatcher selects TF32 path appropriately

✔ Documentation is complete

⚡ Remarks

TF32 TensorCore is the fastest path to exceed cuBLAS FP32 performance and a necessary foundation for LLM inference (FP16/BF16 matmul).
This milestone unlocks the architecture for:

FlashAttention

Transformer block kernels

Fused ops (bias+activation)

QKV projection kernels

PyGPUkit becomes a real inference-class GPU engine after this。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions