🎯 Summary
This milestone introduces TF32 TensorCore acceleration for Ampere+ GPUs (RTX 30-series, A100, H100).
The goal is to surpass the current FP32 FMA kernel (18.3 TFLOPS) and reach:
22–30 TFLOPS on RTX 3090 Ti
40–75 TFLOPS on A100
90–150 TFLOPS on H100 (TF32x3)
This work prepares the foundation for FP16/BF16 and full LLM inference support.
📌 Goals
Item Target
Implement TF32 TensorCore GEMM ✔
Integrate into PyGPUkit kernel dispatcher ✔
Achieve 22–30 TFLOPS on 3090Ti ✔(目標)
Maintain deterministic correctness ✔
Provide Nsight Compute profiles Optional
A100/H100 scalable design ✔
🧠 Technical Requirements
TF32 TensorCore (mma.sync.aligned.m16n8k8.row.col.tf32.tf32.f32)
WMMA API not used
→ we will use inline PTX with mma.sync
Tile base: BM=128, BN=128, BK=32
Warp tile: 16×16 output per warp
Warp groups: 8 warps per block
Fragment A/B: TF32 inputs, accumulated in FP32
Shared memory swizzling required
🧱 Tasks
- Kernel Design (TF32 MMA Kernel)
File: native/ops/matmul_f32_tf32.cuh
Implement warp-level mma.sync microkernel
Implement ldmatrix.sync for A/B fragment loading
Implement 4-stage cp.async pipeline
Shared memory swizzling layout for conflict-free loads
BK=32 tuning for Ampere SMs
128×128 block decomposition (8 warps/block)
- Kernel Dispatcher Integration
File: native/ops/matmul_dispatch.cpp
Add selection logic:
if (GPU >= Ampere && dtype == float32 && allow_tensorcore) {
if (M,N,K >= 1024) return TensorCoreTF32;
}
Add runtime flag PYGPUKIT_ALLOW_TF32=1
Add Python API: matmul(..., use_tf32=True)
Fallback to FMA kernel for small matrices
- Rust Backend Updates
Files:
rust/pygpukit-core/src/ops/matmul.rs
rust/pygpukit-python/src/matmul.rs
Add TF32 kernel ID
Add DeviceCapabilities { tensorcore: bool }
Add execution path for tensorcore kernels
Pass alignment & tiling info to native layer
- Testing & Validation
Files: tests/test_tf32.py
Correctness test (relative error < 1e-2)
Performance test thresholds:
Size Minimum TFLOPS
2048² 15 TFLOPS
4096² 22 TFLOPS
8192² 28 TFLOPS
Compare against PyTorch/cuBLAS
Stress test non-square matrices
- A100/H100 Scaling Support
Query SM count, tensor core per SM
Tune warp-tile count per block
Optional: H100 TF32x3 mode (3× throughput)
Provide benchmark scripts for each GPU family
- Documentation
File: docs/tf32_tensorcore_design.md
Include:
Ampere Tensor Core architecture
mma.sync instruction format
tiling diagrams
shared memory layout
performance tuning strategy
🛰️ Milestone Deliverables
✔ TF32 TensorCore GEMM kernel
✔ Unified matmul dispatcher
✔ Performance guarantees:
22+ TFLOPS on 3090Ti
40+ TFLOPS on A100
90+ TFLOPS on H100
✔ Benchmarks & Nsight Compute screenshots
✔ Documentation
✔ Python API exposure
🧭 Stretch Goals
Hopper TF32x3 mode
CUTLASS-like epilogue fusion (bias + GELU)
Auto-tuner for BM/BN/BK
Kernel cache for compiled PTX
🏁 Acceptance Criteria
The milestone is considered done when:
✔ All correctness tests pass
✔ 4096×4096 TF32 matmul > 22 TFLOPS
✔ No correctness drift observed over 500 iterations
✔ Kernel dispatcher selects TF32 path appropriately
✔ Documentation is complete
⚡ Remarks
TF32 TensorCore is the fastest path to exceed cuBLAS FP32 performance and a necessary foundation for LLM inference (FP16/BF16 matmul).
This milestone unlocks the architecture for:
FlashAttention
Transformer block kernels
Fused ops (bias+activation)
QKV projection kernels
PyGPUkit becomes a real inference-class GPU engine after this。
🎯 Summary
This milestone introduces TF32 TensorCore acceleration for Ampere+ GPUs (RTX 30-series, A100, H100).
The goal is to surpass the current FP32 FMA kernel (18.3 TFLOPS) and reach:
22–30 TFLOPS on RTX 3090 Ti
40–75 TFLOPS on A100
90–150 TFLOPS on H100 (TF32x3)
This work prepares the foundation for FP16/BF16 and full LLM inference support.
📌 Goals
Item Target
Implement TF32 TensorCore GEMM ✔
Integrate into PyGPUkit kernel dispatcher ✔
Achieve 22–30 TFLOPS on 3090Ti ✔(目標)
Maintain deterministic correctness ✔
Provide Nsight Compute profiles Optional
A100/H100 scalable design ✔
🧠 Technical Requirements
TF32 TensorCore (mma.sync.aligned.m16n8k8.row.col.tf32.tf32.f32)
WMMA API not used
→ we will use inline PTX with mma.sync
Tile base: BM=128, BN=128, BK=32
Warp tile: 16×16 output per warp
Warp groups: 8 warps per block
Fragment A/B: TF32 inputs, accumulated in FP32
Shared memory swizzling required
🧱 Tasks
File: native/ops/matmul_f32_tf32.cuh
Implement warp-level mma.sync microkernel
Implement ldmatrix.sync for A/B fragment loading
Implement 4-stage cp.async pipeline
Shared memory swizzling layout for conflict-free loads
BK=32 tuning for Ampere SMs
128×128 block decomposition (8 warps/block)
File: native/ops/matmul_dispatch.cpp
Add selection logic:
if (GPU >= Ampere && dtype == float32 && allow_tensorcore) {
if (M,N,K >= 1024) return TensorCoreTF32;
}
Add runtime flag PYGPUKIT_ALLOW_TF32=1
Add Python API: matmul(..., use_tf32=True)
Fallback to FMA kernel for small matrices
Files:
rust/pygpukit-core/src/ops/matmul.rs
rust/pygpukit-python/src/matmul.rs
Add TF32 kernel ID
Add DeviceCapabilities { tensorcore: bool }
Add execution path for tensorcore kernels
Pass alignment & tiling info to native layer
Files: tests/test_tf32.py
Correctness test (relative error < 1e-2)
Performance test thresholds:
Size Minimum TFLOPS
2048² 15 TFLOPS
4096² 22 TFLOPS
8192² 28 TFLOPS
Compare against PyTorch/cuBLAS
Stress test non-square matrices
Query SM count, tensor core per SM
Tune warp-tile count per block
Optional: H100 TF32x3 mode (3× throughput)
Provide benchmark scripts for each GPU family
File: docs/tf32_tensorcore_design.md
Include:
Ampere Tensor Core architecture
mma.sync instruction format
tiling diagrams
shared memory layout
performance tuning strategy
🛰️ Milestone Deliverables
✔ TF32 TensorCore GEMM kernel
✔ Unified matmul dispatcher
✔ Performance guarantees:
22+ TFLOPS on 3090Ti
40+ TFLOPS on A100
90+ TFLOPS on H100
✔ Benchmarks & Nsight Compute screenshots
✔ Documentation
✔ Python API exposure
🧭 Stretch Goals
Hopper TF32x3 mode
CUTLASS-like epilogue fusion (bias + GELU)
Auto-tuner for BM/BN/BK
Kernel cache for compiled PTX
🏁 Acceptance Criteria
The milestone is considered done when:
✔ All correctness tests pass
✔ 4096×4096 TF32 matmul > 22 TFLOPS
✔ No correctness drift observed over 500 iterations
✔ Kernel dispatcher selects TF32 path appropriately
✔ Documentation is complete
⚡ Remarks
TF32 TensorCore is the fastest path to exceed cuBLAS FP32 performance and a necessary foundation for LLM inference (FP16/BF16 matmul).
This milestone unlocks the architecture for:
FlashAttention
Transformer block kernels
Fused ops (bias+activation)
QKV projection kernels
PyGPUkit becomes a real inference-class GPU engine after this。