Summary
Current TensorCore GEMM kernels (TF32/FP16/BF16) write results directly to global memory without any epilogue fusion. Common operations like bias addition and activation functions require separate kernel launches, causing unnecessary global memory round-trips.
Current Behavior
matmul(A, B) → write to C (GlobalMem)
add(C, bias) → read C, write C (GlobalMem)
relu(C) → read C, write C (GlobalMem)
3 kernel launches, 3 global memory write operations
Proposed Behavior
matmul_fused(A, B, bias, activation='relu') → write to C (GlobalMem)
1 kernel launch, 1 global memory write operation
Implementation Plan
Phase 1: Epilogue Function Pointers
Add template-based epilogue selection to existing kernels:
// In epilogue section
float val = acc[wm][wn][0];
val = epilogue_bias ? val + bias[col] : val; // optional bias
val = epilogue_relu ? fmaxf(val, 0.0f) : val; // optional ReLU
C[...] = __float2half(val);
Phase 2: Common Epilogues
- LinearEpilogue:
C = alpha * A @ B + beta * C
- BiasEpilogue:
C = A @ B + bias
- BiasReluEpilogue:
C = relu(A @ B + bias)
- BiasGeluEpilogue:
C = gelu(A @ B + bias)
Phase 3: Python API
# Option A: Explicit fused op
C = gpk.matmul_fused(A, B, bias=bias, activation='relu')
# Option B: Lazy fusion (JIT detects pattern)
C = gpk.relu(A @ B + bias) # Auto-fused at JIT level
Expected Performance Gain
For a typical transformer layer (matmul + bias + activation):
- Memory bandwidth reduction: ~3x (1 write vs 3 writes)
- Kernel launch overhead: ~3x reduction
- Estimated speedup: 10-30% for memory-bound cases
Affected Files
native/ops/matmul_f32_tf32.cuh
native/ops/matmul_f16_bf16_tc.cuh
native/ops/basic.cu (dispatch logic)
src/pygpukit/ops/ (Python API)
References
Summary
Current TensorCore GEMM kernels (TF32/FP16/BF16) write results directly to global memory without any epilogue fusion. Common operations like bias addition and activation functions require separate kernel launches, causing unnecessary global memory round-trips.
Current Behavior
3 kernel launches, 3 global memory write operations
Proposed Behavior
1 kernel launch, 1 global memory write operation
Implementation Plan
Phase 1: Epilogue Function Pointers
Add template-based epilogue selection to existing kernels:
Phase 2: Common Epilogues
C = alpha * A @ B + beta * CC = A @ B + biasC = relu(A @ B + bias)C = gelu(A @ B + bias)Phase 3: Python API
Expected Performance Gain
For a typical transformer layer (matmul + bias + activation):
Affected Files
native/ops/matmul_f32_tf32.cuhnative/ops/matmul_f16_bf16_tc.cuhnative/ops/basic.cu(dispatch logic)src/pygpukit/ops/(Python API)References