PyGPUkit v0.2.3 — TF32 TensorCore GEMM Implementation (22–30 TFLOPS Target)

🎯 Summary

This milestone introduces TF32 TensorCore acceleration for Ampere+ GPUs (RTX 30-series, A100, H100).
The goal is to surpass the current FP32 FMA kernel (18.3 TFLOPS) and reach:

22–30 TFLOPS on RTX 3090 Ti

40–75 TFLOPS on A100

90–150 TFLOPS on H100 (TF32x3)

This work prepares the foundation for FP16/BF16 and full LLM inference support.

📌 Goals
Item	Target
Implement TF32 TensorCore GEMM	✔
Integrate into PyGPUkit kernel dispatcher	✔
Achieve 22–30 TFLOPS on 3090Ti	✔（目標）
Maintain deterministic correctness	✔
Provide Nsight Compute profiles	Optional
A100/H100 scalable design	✔
🧠 Technical Requirements
TF32 TensorCore (mma.sync.aligned.m16n8k8.row.col.tf32.tf32.f32)

WMMA API not used
→ we will use inline PTX with mma.sync

Tile base: BM=128, BN=128, BK=32

Warp tile: 16×16 output per warp

Warp groups: 8 warps per block

Fragment A/B: TF32 inputs, accumulated in FP32

Shared memory swizzling required

🧱 Tasks
1. Kernel Design (TF32 MMA Kernel)

File: native/ops/matmul_f32_tf32.cuh

 Implement warp-level mma.sync microkernel

 Implement ldmatrix.sync for A/B fragment loading

 Implement 4-stage cp.async pipeline

 Shared memory swizzling layout for conflict-free loads

 BK=32 tuning for Ampere SMs

 128×128 block decomposition (8 warps/block)

2. Kernel Dispatcher Integration

File: native/ops/matmul_dispatch.cpp

Add selection logic:

if (GPU >= Ampere && dtype == float32 && allow_tensorcore) {
    if (M,N,K >= 1024) return TensorCoreTF32;
}


 Add runtime flag PYGPUKIT_ALLOW_TF32=1

 Add Python API: matmul(..., use_tf32=True)

 Fallback to FMA kernel for small matrices

3. Rust Backend Updates

Files:
rust/pygpukit-core/src/ops/matmul.rs
rust/pygpukit-python/src/matmul.rs

 Add TF32 kernel ID

 Add DeviceCapabilities { tensorcore: bool }

 Add execution path for tensorcore kernels

 Pass alignment & tiling info to native layer

4. Testing & Validation

Files: tests/test_tf32.py

 Correctness test (relative error < 1e-2)

 Performance test thresholds:

Size	Minimum TFLOPS
2048²	15 TFLOPS
4096²	22 TFLOPS
8192²	28 TFLOPS

 Compare against PyTorch/cuBLAS

 Stress test non-square matrices

5. A100/H100 Scaling Support

 Query SM count, tensor core per SM

 Tune warp-tile count per block

 Optional: H100 TF32x3 mode (3× throughput)

 Provide benchmark scripts for each GPU family

6. Documentation

File: docs/tf32_tensorcore_design.md

Include:

Ampere Tensor Core architecture

mma.sync instruction format

tiling diagrams

shared memory layout

performance tuning strategy

🛰️ Milestone Deliverables

✔ TF32 TensorCore GEMM kernel

✔ Unified matmul dispatcher

✔ Performance guarantees:

22+ TFLOPS on 3090Ti

40+ TFLOPS on A100

90+ TFLOPS on H100

✔ Benchmarks & Nsight Compute screenshots

✔ Documentation

✔ Python API exposure

🧭 Stretch Goals

 Hopper TF32x3 mode

 CUTLASS-like epilogue fusion (bias + GELU)

 Auto-tuner for BM/BN/BK

 Kernel cache for compiled PTX

🏁 Acceptance Criteria

The milestone is considered done when:

✔ All correctness tests pass

✔ 4096×4096 TF32 matmul > 22 TFLOPS

✔ No correctness drift observed over 500 iterations

✔ Kernel dispatcher selects TF32 path appropriately

✔ Documentation is complete

⚡ Remarks

TF32 TensorCore is the fastest path to exceed cuBLAS FP32 performance and a necessary foundation for LLM inference (FP16/BF16 matmul).
This milestone unlocks the architecture for:

FlashAttention

Transformer block kernels

Fused ops (bias+activation)

QKV projection kernels

PyGPUkit becomes a real inference-class GPU engine after this。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyGPUkit v0.2.3 — TF32 TensorCore GEMM Implementation (22–30 TFLOPS Target) #41

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

PyGPUkit v0.2.3 — TF32 TensorCore GEMM Implementation (22–30 TFLOPS Target) #41

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions