Overview
Continue TF32 TensorCore GEMM optimization beyond the v0.2.3 baseline (27.38 TFLOPS).
Using NVIDIA CUTLASS as reference.
Current Status
- Achieved: 27.38 TFLOPS on RTX 3090 Ti (8192×8192)
- cuBLAS reference: ~59 TFLOPS (47% gap remaining)
Remaining Optimization Opportunities
High Priority
Medium Priority
Future / Architecture-Specific
Key Observations from v0.2.3
- Register pressure is the primary limiter - acc[2][8][4] = 64 registers per warp
- Shared memory limits occupancy - 37KB/block → ~1 block/SM = 16.7% occupancy
- High variance (±1-2 TFLOPS) due to GPU boost clocks
Failed Attempts (Reference)
| Attempt |
Result |
Cause |
| 3-stage pipeline |
-28% |
50% more smem reduced occupancy |
| 512-thread config |
-17% |
Changed access patterns |
| BM=64 smaller tiles |
-28% on 8192 |
Reduced parallelism |
| Manual kk unroll |
-6% |
Increased register pressure |
| BK=8 for occupancy |
-7% on 4096 |
Doubled K iterations |
Target
- RTX 3090 Ti: 35+ TFLOPS (cuBLAS parity)
- A100: 60+ TFLOPS
- H100: 90+ TFLOPS (TF32x3)
Related Issues
Overview
Continue TF32 TensorCore GEMM optimization beyond the v0.2.3 baseline (27.38 TFLOPS).
Using NVIDIA CUTLASS as reference.
Current Status
Remaining Optimization Opportunities
High Priority
ldmatrixinstruction for fragment loadingMedium Priority
m16n8k4instruction variantFuture / Architecture-Specific
Key Observations from v0.2.3
Failed Attempts (Reference)
Target
Related Issues