m96-chan · m96-chan · Dec 15, 2025 · Dec 14, 2025 · Dec 14, 2025 · Dec 14, 2025
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -362,6 +362,99 @@ store_matrix_sync(C, c_frag, N, mem_row_major);
 
 ---
 
+## TF32 Optimization Research (Issue #53)
+
+### Current Performance Status
+
+| Metric | Value |
+|--------|-------|
+| Current | **27.38 TFLOPS** (8192×8192) |
+| RTX 3090 Ti TF32 Theoretical | ~40 TFLOPS |
+| cuBLAS Reference | ~59 TFLOPS |
+| Gap to cuBLAS | **47%** |
+
+### Current Implementation Parameters
+
+```
+Block Tile: BM=128, BN=128, BK=16
+Warp Tile: WARP_TILES_M=2, WARP_TILES_N=8 (32×64 per warp)
+MMA Instruction: mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32
+Pipeline: 2-stage double buffering
+Thread Block: 256 threads (8 warps)
+Shared Memory: ~37KB/block → occupancy ~16.7%
+```
+
+### CUTLASS Optimization Techniques
+
+#### 1. Swizzled Shared Memory Layout (High Priority)
+
+Current implementation uses simple padding (`A_PAD=4, B_PAD=4`) but bank conflicts are not fully eliminated.
+
+**CUTLASS Approach:**
+```cpp
+// XOR-based swizzle pattern
+int store_column = (lane_id % 8) ^ (lane_id / 8);
+```
+
+- Store and Load phases use transposed index relationship
+- XOR operation applied per 8×8 block unit
+- Combined with `ldmatrix` for fully bank conflict-free access
+
+**Key Insight:**
+> "the indexing in the 'Loading from Shared Memory to Registers' slide is transposed from the indexing in 'Load from Global/Store to Shared' slide."
+
+#### 2. ldmatrix Instruction (High Priority)
+
+Current implementation manually loads from shared memory to registers:
+```cpp
+// Current implementation
+float a0 = smA[curr][tile_m + a_row_base][kk + a_col_base];
+```
+
+**CUTLASS Approach:**
+- Uses `ldmatrix.sync.aligned.m8n8.x4.shared.b16`
+- Single instruction loads four 8×8 matrices (entire warp)
+
+**TF32 Limitation:**
+> "ldmatrix cannot transpose 32-bit data. CUTLASS uses 32-bit shared memory load to load data from shared memory to the registers to do the transpose right before calling tf32 tensor core."
+
+#### 3. Multi-stage Pipeline (Medium-High Priority)
+
+Current: 2-stage → CUTLASS default: **4-stage**
+
+**Past Failed Attempt:**
+> "3-stage pipeline: -28% (50% more smem reduced occupancy)"
+
+**Considerations:**
+- Trade-off between shared memory usage and occupancy
+- RTX 3090 Ti: 100KB/SM available
+- Current 37KB → 4-stage at ~74KB should fit
+
+### Recommended Implementation Order
+
+| Priority | Optimization | Expected Gain | Difficulty |
+|----------|-------------|---------------|------------|
+| 1 | Swizzled shared memory layout | +10-15% | Medium |
+| 2 | 4-stage pipeline (proper smem sizing) | +5-10% | Medium |
+| 3 | Warp tile tuning (BM/BN/BK re-tuning) | +5-10% | Low |
+| 4 | Epilogue fusion (bias + activation) | Memory reduction | Medium |
+
+### Path to 35 TFLOPS
+
+- Current: 27.38 TFLOPS (68% of target)
+- Swizzle + 4-stage: 32-34 TFLOPS expected
+- Fine-tuning: 35+ TFLOPS
+
+### Reference Materials
+
+- [CUTLASS TF32 GEMM Example](https://github.com/NVIDIA/cutlass/blob/main/examples/14_ampere_tf32_tensorop_gemm/ampere_tf32_tensorop_gemm.cu)
+- [CUTLASS Efficient GEMM Documentation](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/efficient_gemm.html)
+- [CUTLASS Swizzled Layouts Discussion](https://github.com/NVIDIA/cutlass/discussions/1130)
+- [Understanding CUTLASS Permuted Shared Memory](https://forums.developer.nvidia.com/t/understanding-cutlass-permuted-shared-memory-layout/303697)
+- [Dissecting Tensor Cores (Academic Paper)](https://arxiv.org/pdf/2206.02874)
+
+---
+
 ## Development Workflow
 
 ### Kernel Development Cycle
@@ -372,12 +465,48 @@ Edit → Build → Validate → Benchmark → Commit
 
 **Always commit after validation and benchmark, regardless of results.**
 
+### Pre-Commit Checks (MANDATORY)
+
+**Before EVERY commit, run these checks:**
+
+```bash
+# 1. Ruff lint check (auto-fix and format)
+git ls-files "*.py" | xargs python -m ruff check --fix
+git ls-files "*.py" | xargs python -m ruff format
+
+# 2. Mypy type check
+python -m mypy src/ --ignore-missing-imports --disable-error-code=union-attr --disable-error-code=no-redef --disable-error-code=no-any-return --disable-error-code=attr-defined
+```
+
+**NEVER commit without passing ALL checks.** CI will reject PRs with lint/type errors.
+
+### PR Checklist (MANDATORY before `gh pr create`)
+
+Before creating a PR, verify ALL of the following:
+
+```bash
+# 1. Lint passes
+git ls-files "*.py" | xargs python -m ruff check
+
+# 2. Mypy passes
+python -m mypy src/ --ignore-missing-imports --disable-error-code=union-attr --disable-error-code=no-redef --disable-error-code=no-any-return --disable-error-code=attr-defined
+
+# 3. Tests pass
+python -m pytest tests/ -v
+
+# 4. Benchmark runs (optional but recommended)
+python benchmark.py --quick
+```
+
+**DO NOT create PR until all checks pass locally.**
+
 ### Commit Rules
 
-1. Commit after every validation/benchmark completion, regardless of outcome
-2. Include benchmark results in commit message
-3. Never proceed to next kernel edit until commit is complete
-4. Never overwrite a working kernel without committing first
+1. **Run lint check before commit** (see above)
+2. Commit after every validation/benchmark completion, regardless of outcome
+3. Include benchmark results in commit message
+4. Never proceed to next kernel edit until commit is complete
+5. Never overwrite a working kernel without committing first
 
 ### Commit Message Format
 
@@ -410,6 +539,34 @@ If performance or correctness degrades:
 - Track performance changes over time
 - Preserve trial-and-error history
 
+### Benchmarking
+
+**Always use `benchmark.py` for performance measurement.**
+
+```bash
+# Full benchmark (all dtypes, all sizes)
+python benchmark.py
+
+# Quick mode (fewer warmup/iterations)
+python benchmark.py --quick
+
+# Specific sizes
+python benchmark.py --sizes 4096 8192
+
+# TF32 kernel version selection
+python benchmark.py --tf32-version v1   # WMMA API
+python benchmark.py --tf32-version v2   # PTX mma.sync (default)
+```
+
+**Output includes:**
+- Kernel-only timing (no D2H copy overhead)
+- Correctness verification (relative error)
+- README.md-ready table format
+
+**Environment Variables:**
+- `PYGPUKIT_ALLOW_TF32=1` - Enable TF32 TensorCore
+- `PYGPUKIT_TF32_V2=1` - Use PTX mma.sync kernel (default when TF32 enabled)
+
 ---
 
 ## Design Principles

diff --git a/README.md b/README.md
@@ -21,6 +21,51 @@ PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and idea
 
 ---
 
+## What's New in v0.2.5
+
+### FP16 / BF16 Support
+| Feature | Description |
+|---------|-------------|
+| **FP16 (float16)** | Half-precision floating point |
+| **BF16 (bfloat16)** | Brain floating point (better dynamic range) |
+| **FP32 Accumulation** | Numerical stability via FP32 intermediate |
+| **Type Conversion** | `astype()` for seamless dtype conversion |
+
+```python
+import pygpukit as gpk
+import numpy as np
+
+# FP16 operations
+a = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
+b = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
+c = a @ b  # FP16 matmul
+
+# BF16 operations
+arr = np.random.randn(1024, 1024).astype(np.float32)
+a_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
+b_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
+c_bf16 = a_bf16 @ b_bf16  # BF16 matmul
+result = c_bf16.astype(gpk.float32)  # Convert back to FP32
+```
+
+### Reduction Operations
+| Operation | Description |
+|-----------|-------------|
+| `gpk.sum(a)` | Sum of all elements |
+| `gpk.mean(a)` | Mean of all elements |
+| `gpk.max(a)` | Maximum element |
+
+### Operator Overloads
+```python
+c = a + b   # Element-wise add
+c = a - b   # Element-wise subtract
+c = a * b   # Element-wise multiply
+c = a / b   # Element-wise divide
+c = a @ b   # Matrix multiplication
+```
+
+---
+
 ## What's New in v0.2.4
 
 ### Single-Binary Distribution
@@ -58,18 +103,19 @@ print(f"NVRTC Path: {gp.get_nvrtc_path()}")   # Path to NVRTC DLL (if available)
 |---------|------|------|--------------|
 | **NumPy** (OpenBLAS) | ~0.8 TFLOPS | — | CPU only |
 | **cuBLAS** | ~21 TFLOPS | ~59 TFLOPS | CUDA Toolkit |
-| **PyGPUkit** (Driver-Only) | 17.7 TFLOPS | 28.2 TFLOPS | GPU drivers only |
-| **PyGPUkit** (Full) | 17.7 TFLOPS | 30.3 TFLOPS | GPU drivers + CUDA Toolkit |
+| **PyGPUkit** | 16.7 TFLOPS | 29.7 TFLOPS | GPU drivers only |
 
-> Driver-Only mode uses pre-compiled kernels. Full mode adds JIT compilation for custom kernels with slightly better TF32 optimization.
+> Built-in matmul kernels are pre-compiled. Driver-Only and Full (JIT) modes have identical matmul performance. JIT is only needed for custom kernels.
 
 ### PyGPUkit Performance by Matrix Size
 
-| Matrix Size | FP32 | TF32 (Driver-Only) | TF32 (Full) |
-|-------------|------|-------------------|-------------|
-| 2048×2048 | 8.7 TFLOPS | 12.2 TFLOPS | 13.0 TFLOPS |
-| 4096×4096 | 14.2 TFLOPS | 22.0 TFLOPS | 23.5 TFLOPS |
-| 8192×8192 | 17.7 TFLOPS | 28.2 TFLOPS | **30.3 TFLOPS** |
+| Matrix Size | FP32 | TF32 | FP16 | BF16 |
+|-------------|------|------|------|------|
+| 2048×2048 | 9.6 TFLOPS | 13.2 TFLOPS | 2.4 TFLOPS | 2.4 TFLOPS |
+| 4096×4096 | 14.7 TFLOPS | 22.8 TFLOPS | 2.4 TFLOPS | 2.3 TFLOPS |
+| 8192×8192 | 16.7 TFLOPS | 29.7 TFLOPS | 2.3 TFLOPS | 2.3 TFLOPS |
+
+> **Note:** FP16/BF16 matmul uses simple kernels with FP32 accumulation. TensorCore optimization planned for future releases (see [Issue #60](https://github.com/m96-chan/PyGPUkit/issues/60)).
 
 ---
 
@@ -218,13 +264,14 @@ PyGPUkit/
 | **v0.2.2** | Ampere SGEMM (cp.async, float4), 18 TFLOPS FP32 |
 | **v0.2.3** | TF32 TensorCore (PTX mma.sync), 28 TFLOPS |
 | **v0.2.4** | **Single-binary distribution**, dynamic NVRTC, driver-only mode |
+| **v0.2.5** | **FP16/BF16 support**, reduction ops, operator overloads, TF32 v2 (~30 TFLOPS) |
 
 ### Planned
 
 | Version | Goals |
 |---------|-------|
-| **v0.2.5** | Multi-GPU detection, NCCL preliminary support |
-| **v0.2.6** | Full API review, documentation, backward compatibility |
+| **v0.2.6** | FP16/BF16 TensorCore optimization, Multi-GPU detection |
+| **v0.2.7** | Full API review, documentation, backward compatibility |
 | **v0.3** | Triton backend, advanced ops (softmax, layernorm), MPS/MIG |
 
 ---