[v1.0] TF32 Kernel Further Optimization (CUTLASS-level)

## Overview

Continue TF32 TensorCore GEMM optimization beyond the v0.2.3 baseline (27.38 TFLOPS).
Using [NVIDIA CUTLASS](https://github.com/NVIDIA/cutlass) as reference.

## Current Status

- **Achieved**: 27.38 TFLOPS on RTX 3090 Ti (8192×8192)
- **cuBLAS reference**: ~59 TFLOPS (47% gap remaining)

## Remaining Optimization Opportunities

### High Priority
- [ ] Swizzled shared memory layout (bank conflict elimination)
- [ ] Multi-stage pipeline (4+ stages)
- [ ] PTX `ldmatrix` instruction for fragment loading

### Medium Priority
- [ ] `m16n8k4` instruction variant
- [ ] Warp specialization (producer/consumer warps)
- [ ] Dynamic shared memory (100KB on Ampere)

### Future / Architecture-Specific
- [ ] Hopper TF32x3 mode (3× throughput on H100)
- [ ] Epilogue fusion (bias + activation)
- [ ] Auto-tuner for tile sizes (BM/BN/BK)

## Key Observations from v0.2.3

1. **Register pressure is the primary limiter** - acc[2][8][4] = 64 registers per warp
2. **Shared memory limits occupancy** - 37KB/block → ~1 block/SM = 16.7% occupancy
3. **High variance** (±1-2 TFLOPS) due to GPU boost clocks

## Failed Attempts (Reference)

| Attempt | Result | Cause |
|---------|--------|-------|
| 3-stage pipeline | -28% | 50% more smem reduced occupancy |
| 512-thread config | -17% | Changed access patterns |
| BM=64 smaller tiles | -28% on 8192 | Reduced parallelism |
| Manual kk unroll | -6% | Increased register pressure |
| BK=8 for occupancy | -7% on 4096 | Doubled K iterations |

## Target

- **RTX 3090 Ti**: 35+ TFLOPS (cuBLAS parity)
- **A100**: 60+ TFLOPS
- **H100**: 90+ TFLOPS (TF32x3)

## Related Issues

- Closed #41 (v0.2.3 TF32 implementation)
- #43 (A100/H100 Runpod validation)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.0] TF32 Kernel Further Optimization (CUTLASS-level) #53

Overview

Current Status

Remaining Optimization Opportunities

High Priority

Medium Priority

Future / Architecture-Specific

Key Observations from v0.2.3

Failed Attempts (Reference)

Target

Related Issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Attempt	Result	Cause
3-stage pipeline	-28%	50% more smem reduced occupancy
512-thread config	-17%	Changed access patterns
BM=64 smaller tiles	-28% on 8192	Reduced parallelism
Manual kk unroll	-6%	Increased register pressure
BK=8 for occupancy	-7% on 4096	Doubled K iterations

[v1.0] TF32 Kernel Further Optimization (CUTLASS-level) #53

Description

Overview

Current Status

Remaining Optimization Opportunities

High Priority

Medium Priority

Future / Architecture-Specific

Key Observations from v0.2.3

Failed Attempts (Reference)

Target

Related Issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions