[v0.2.7] Epilogue Fusion for TensorCore GEMM (bias + activation)

## Summary

Current TensorCore GEMM kernels (TF32/FP16/BF16) write results directly to global memory without any epilogue fusion. Common operations like bias addition and activation functions require separate kernel launches, causing unnecessary global memory round-trips.

## Current Behavior

```
matmul(A, B) → write to C (GlobalMem)
add(C, bias) → read C, write C (GlobalMem)
relu(C)      → read C, write C (GlobalMem)
```

**3 kernel launches, 3 global memory write operations**

## Proposed Behavior

```
matmul_fused(A, B, bias, activation='relu') → write to C (GlobalMem)
```

**1 kernel launch, 1 global memory write operation**

## Implementation Plan

### Phase 1: Epilogue Function Pointers
Add template-based epilogue selection to existing kernels:

```cpp
// In epilogue section
float val = acc[wm][wn][0];
val = epilogue_bias ? val + bias[col] : val;      // optional bias
val = epilogue_relu ? fmaxf(val, 0.0f) : val;     // optional ReLU
C[...] = __float2half(val);
```

### Phase 2: Common Epilogues
- **LinearEpilogue**: `C = alpha * A @ B + beta * C`
- **BiasEpilogue**: `C = A @ B + bias`
- **BiasReluEpilogue**: `C = relu(A @ B + bias)`
- **BiasGeluEpilogue**: `C = gelu(A @ B + bias)`

### Phase 3: Python API
```python
# Option A: Explicit fused op
C = gpk.matmul_fused(A, B, bias=bias, activation='relu')

# Option B: Lazy fusion (JIT detects pattern)
C = gpk.relu(A @ B + bias)  # Auto-fused at JIT level
```

## Expected Performance Gain

For a typical transformer layer (matmul + bias + activation):
- **Memory bandwidth reduction**: ~3x (1 write vs 3 writes)
- **Kernel launch overhead**: ~3x reduction
- **Estimated speedup**: 10-30% for memory-bound cases

## Affected Files

- `native/ops/matmul_f32_tf32.cuh`
- `native/ops/matmul_f16_bf16_tc.cuh`
- `native/ops/basic.cu` (dispatch logic)
- `src/pygpukit/ops/` (Python API)

## References

- CUTLASS Epilogue: https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue
- Issue #60 (FP16/BF16 TensorCore - completed)
- Issue #53 (TF32 optimization)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.2.7] Epilogue Fusion for TensorCore GEMM (bias + activation) #62

Summary

Current Behavior

Proposed Behavior

Implementation Plan

Phase 1: Epilogue Function Pointers

Phase 2: Common Epilogues

Phase 3: Python API

Expected Performance Gain

Affected Files

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[v0.2.7] Epilogue Fusion for TensorCore GEMM (bias + activation) #62

Description

Summary

Current Behavior

Proposed Behavior

Implementation Plan

Phase 1: Epilogue Function Pointers

Phase 2: Common Epilogues

Phase 3: Python API

Expected Performance Gain

Affected Files

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions