Skip to content

[v0.2.7] Epilogue Fusion for TensorCore GEMM (bias + activation) #62

@m96-chan

Description

@m96-chan

Summary

Current TensorCore GEMM kernels (TF32/FP16/BF16) write results directly to global memory without any epilogue fusion. Common operations like bias addition and activation functions require separate kernel launches, causing unnecessary global memory round-trips.

Current Behavior

matmul(A, B) → write to C (GlobalMem)
add(C, bias) → read C, write C (GlobalMem)
relu(C)      → read C, write C (GlobalMem)

3 kernel launches, 3 global memory write operations

Proposed Behavior

matmul_fused(A, B, bias, activation='relu') → write to C (GlobalMem)

1 kernel launch, 1 global memory write operation

Implementation Plan

Phase 1: Epilogue Function Pointers

Add template-based epilogue selection to existing kernels:

// In epilogue section
float val = acc[wm][wn][0];
val = epilogue_bias ? val + bias[col] : val;      // optional bias
val = epilogue_relu ? fmaxf(val, 0.0f) : val;     // optional ReLU
C[...] = __float2half(val);

Phase 2: Common Epilogues

  • LinearEpilogue: C = alpha * A @ B + beta * C
  • BiasEpilogue: C = A @ B + bias
  • BiasReluEpilogue: C = relu(A @ B + bias)
  • BiasGeluEpilogue: C = gelu(A @ B + bias)

Phase 3: Python API

# Option A: Explicit fused op
C = gpk.matmul_fused(A, B, bias=bias, activation='relu')

# Option B: Lazy fusion (JIT detects pattern)
C = gpk.relu(A @ B + bias)  # Auto-fused at JIT level

Expected Performance Gain

For a typical transformer layer (matmul + bias + activation):

  • Memory bandwidth reduction: ~3x (1 write vs 3 writes)
  • Kernel launch overhead: ~3x reduction
  • Estimated speedup: 10-30% for memory-bound cases

Affected Files

  • native/ops/matmul_f32_tf32.cuh
  • native/ops/matmul_f16_bf16_tc.cuh
  • native/ops/basic.cu (dispatch logic)
  • src/pygpukit/ops/ (Python API)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions