Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
7da4d10
feat(jit): add JIT compiler stabilization (#55)
m96-chan Dec 14, 2025
b7c5224
feat(cache): implement persistent kernel cache with LRU eviction (#54)
m96-chan Dec 14, 2025
8b0f8d2
feat(ops): add sub, div, exp, log, relu elementwise operations (#59)
m96-chan Dec 14, 2025
1e816b6
docs(claude): add TF32 optimization research for Issue #53
m96-chan Dec 14, 2025
e27f1fa
wip(tf32): add v2 kernel baseline (#53)
m96-chan Dec 14, 2025
4e158b1
wip(tf32-v2): WMMA API version - regression
m96-chan Dec 14, 2025
6c66943
wip(tf32-v2): PTX m16n8k8 with BK=32 - correct but slower
m96-chan Dec 14, 2025
d9c2d73
wip(tf32-v2): WMMA BK=16 with dynamic smem
m96-chan Dec 14, 2025
465358e
wip(tf32-v2): 3-stage pipeline - regression due to smem
m96-chan Dec 14, 2025
dd8842a
wip(tf32-v2): v1 baseline copy with float2 stores
m96-chan Dec 14, 2025
0b77172
wip(tf32-v2): various configurations tested
m96-chan Dec 14, 2025
b26a876
wip(tf32): preload A fragments optimization
m96-chan Dec 14, 2025
0f8524b
wip(tf32): v2 optimization attempts - ~29 TFLOPS achieved
m96-chan Dec 14, 2025
990a691
wip(#58): add FP16/BF16 support and reduction ops
m96-chan Dec 15, 2025
0264538
feat(#58): add operator overloads and astype method to GPUArray
m96-chan Dec 15, 2025
2a9b6d8
docs: add v0.2.5 demo and update README with FP16/BF16 benchmarks
m96-chan Dec 15, 2025
556c9c8
feat: add comprehensive benchmark script (benchmark_all.py)
m96-chan Dec 15, 2025
1a0df08
bench: comprehensive benchmark script with TF32 v1/v2 support
m96-chan Dec 15, 2025
ececb65
refactor: consolidate benchmark files into benchmark.py
m96-chan Dec 15, 2025
82afa7e
docs: update for v0.2.5 release
m96-chan Dec 15, 2025
4ba0716
fix: lint errors in benchmark.py
m96-chan Dec 15, 2025
07d6162
fix: lint errors across codebase
m96-chan Dec 15, 2025
739f18a
docs: add mandatory lint check rule to CLAUDE.md
m96-chan Dec 15, 2025
ec2a40e
fix: lint/mypy errors and add PR checklist to CLAUDE.md
m96-chan Dec 15, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 161 additions & 4 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -362,6 +362,99 @@ store_matrix_sync(C, c_frag, N, mem_row_major);

---

## TF32 Optimization Research (Issue #53)

### Current Performance Status

| Metric | Value |
|--------|-------|
| Current | **27.38 TFLOPS** (8192×8192) |
| RTX 3090 Ti TF32 Theoretical | ~40 TFLOPS |
| cuBLAS Reference | ~59 TFLOPS |
| Gap to cuBLAS | **47%** |

### Current Implementation Parameters

```
Block Tile: BM=128, BN=128, BK=16
Warp Tile: WARP_TILES_M=2, WARP_TILES_N=8 (32×64 per warp)
MMA Instruction: mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32
Pipeline: 2-stage double buffering
Thread Block: 256 threads (8 warps)
Shared Memory: ~37KB/block → occupancy ~16.7%
```

### CUTLASS Optimization Techniques

#### 1. Swizzled Shared Memory Layout (High Priority)

Current implementation uses simple padding (`A_PAD=4, B_PAD=4`) but bank conflicts are not fully eliminated.

**CUTLASS Approach:**
```cpp
// XOR-based swizzle pattern
int store_column = (lane_id % 8) ^ (lane_id / 8);
```

- Store and Load phases use transposed index relationship
- XOR operation applied per 8×8 block unit
- Combined with `ldmatrix` for fully bank conflict-free access

**Key Insight:**
> "the indexing in the 'Loading from Shared Memory to Registers' slide is transposed from the indexing in 'Load from Global/Store to Shared' slide."

#### 2. ldmatrix Instruction (High Priority)

Current implementation manually loads from shared memory to registers:
```cpp
// Current implementation
float a0 = smA[curr][tile_m + a_row_base][kk + a_col_base];
```

**CUTLASS Approach:**
- Uses `ldmatrix.sync.aligned.m8n8.x4.shared.b16`
- Single instruction loads four 8×8 matrices (entire warp)

**TF32 Limitation:**
> "ldmatrix cannot transpose 32-bit data. CUTLASS uses 32-bit shared memory load to load data from shared memory to the registers to do the transpose right before calling tf32 tensor core."

#### 3. Multi-stage Pipeline (Medium-High Priority)

Current: 2-stage → CUTLASS default: **4-stage**

**Past Failed Attempt:**
> "3-stage pipeline: -28% (50% more smem reduced occupancy)"

**Considerations:**
- Trade-off between shared memory usage and occupancy
- RTX 3090 Ti: 100KB/SM available
- Current 37KB → 4-stage at ~74KB should fit

### Recommended Implementation Order

| Priority | Optimization | Expected Gain | Difficulty |
|----------|-------------|---------------|------------|
| 1 | Swizzled shared memory layout | +10-15% | Medium |
| 2 | 4-stage pipeline (proper smem sizing) | +5-10% | Medium |
| 3 | Warp tile tuning (BM/BN/BK re-tuning) | +5-10% | Low |
| 4 | Epilogue fusion (bias + activation) | Memory reduction | Medium |

### Path to 35 TFLOPS

- Current: 27.38 TFLOPS (68% of target)
- Swizzle + 4-stage: 32-34 TFLOPS expected
- Fine-tuning: 35+ TFLOPS

### Reference Materials

- [CUTLASS TF32 GEMM Example](https://github.com/NVIDIA/cutlass/blob/main/examples/14_ampere_tf32_tensorop_gemm/ampere_tf32_tensorop_gemm.cu)
- [CUTLASS Efficient GEMM Documentation](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/efficient_gemm.html)
- [CUTLASS Swizzled Layouts Discussion](https://github.com/NVIDIA/cutlass/discussions/1130)
- [Understanding CUTLASS Permuted Shared Memory](https://forums.developer.nvidia.com/t/understanding-cutlass-permuted-shared-memory-layout/303697)
- [Dissecting Tensor Cores (Academic Paper)](https://arxiv.org/pdf/2206.02874)

---

## Development Workflow

### Kernel Development Cycle
Expand All @@ -372,12 +465,48 @@ Edit → Build → Validate → Benchmark → Commit

**Always commit after validation and benchmark, regardless of results.**

### Pre-Commit Checks (MANDATORY)

**Before EVERY commit, run these checks:**

```bash
# 1. Ruff lint check (auto-fix and format)
git ls-files "*.py" | xargs python -m ruff check --fix
git ls-files "*.py" | xargs python -m ruff format

# 2. Mypy type check
python -m mypy src/ --ignore-missing-imports --disable-error-code=union-attr --disable-error-code=no-redef --disable-error-code=no-any-return --disable-error-code=attr-defined
```

**NEVER commit without passing ALL checks.** CI will reject PRs with lint/type errors.

### PR Checklist (MANDATORY before `gh pr create`)

Before creating a PR, verify ALL of the following:

```bash
# 1. Lint passes
git ls-files "*.py" | xargs python -m ruff check

# 2. Mypy passes
python -m mypy src/ --ignore-missing-imports --disable-error-code=union-attr --disable-error-code=no-redef --disable-error-code=no-any-return --disable-error-code=attr-defined

# 3. Tests pass
python -m pytest tests/ -v

# 4. Benchmark runs (optional but recommended)
python benchmark.py --quick
```

**DO NOT create PR until all checks pass locally.**

### Commit Rules

1. Commit after every validation/benchmark completion, regardless of outcome
2. Include benchmark results in commit message
3. Never proceed to next kernel edit until commit is complete
4. Never overwrite a working kernel without committing first
1. **Run lint check before commit** (see above)
2. Commit after every validation/benchmark completion, regardless of outcome
3. Include benchmark results in commit message
4. Never proceed to next kernel edit until commit is complete
5. Never overwrite a working kernel without committing first

### Commit Message Format

Expand Down Expand Up @@ -410,6 +539,34 @@ If performance or correctness degrades:
- Track performance changes over time
- Preserve trial-and-error history

### Benchmarking

**Always use `benchmark.py` for performance measurement.**

```bash
# Full benchmark (all dtypes, all sizes)
python benchmark.py

# Quick mode (fewer warmup/iterations)
python benchmark.py --quick

# Specific sizes
python benchmark.py --sizes 4096 8192

# TF32 kernel version selection
python benchmark.py --tf32-version v1 # WMMA API
python benchmark.py --tf32-version v2 # PTX mma.sync (default)
```

**Output includes:**
- Kernel-only timing (no D2H copy overhead)
- Correctness verification (relative error)
- README.md-ready table format

**Environment Variables:**
- `PYGPUKIT_ALLOW_TF32=1` - Enable TF32 TensorCore
- `PYGPUKIT_TF32_V2=1` - Use PTX mma.sync kernel (default when TF32 enabled)

---

## Design Principles
Expand Down
67 changes: 57 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,51 @@ PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and idea

---

## What's New in v0.2.5

### FP16 / BF16 Support
| Feature | Description |
|---------|-------------|
| **FP16 (float16)** | Half-precision floating point |
| **BF16 (bfloat16)** | Brain floating point (better dynamic range) |
| **FP32 Accumulation** | Numerical stability via FP32 intermediate |
| **Type Conversion** | `astype()` for seamless dtype conversion |

```python
import pygpukit as gpk
import numpy as np

# FP16 operations
a = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
b = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
c = a @ b # FP16 matmul

# BF16 operations
arr = np.random.randn(1024, 1024).astype(np.float32)
a_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
b_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
c_bf16 = a_bf16 @ b_bf16 # BF16 matmul
result = c_bf16.astype(gpk.float32) # Convert back to FP32
```

### Reduction Operations
| Operation | Description |
|-----------|-------------|
| `gpk.sum(a)` | Sum of all elements |
| `gpk.mean(a)` | Mean of all elements |
| `gpk.max(a)` | Maximum element |

### Operator Overloads
```python
c = a + b # Element-wise add
c = a - b # Element-wise subtract
c = a * b # Element-wise multiply
c = a / b # Element-wise divide
c = a @ b # Matrix multiplication
```

---

## What's New in v0.2.4

### Single-Binary Distribution
Expand Down Expand Up @@ -58,18 +103,19 @@ print(f"NVRTC Path: {gp.get_nvrtc_path()}") # Path to NVRTC DLL (if available)
|---------|------|------|--------------|
| **NumPy** (OpenBLAS) | ~0.8 TFLOPS | — | CPU only |
| **cuBLAS** | ~21 TFLOPS | ~59 TFLOPS | CUDA Toolkit |
| **PyGPUkit** (Driver-Only) | 17.7 TFLOPS | 28.2 TFLOPS | GPU drivers only |
| **PyGPUkit** (Full) | 17.7 TFLOPS | 30.3 TFLOPS | GPU drivers + CUDA Toolkit |
| **PyGPUkit** | 16.7 TFLOPS | 29.7 TFLOPS | GPU drivers only |

> Driver-Only mode uses pre-compiled kernels. Full mode adds JIT compilation for custom kernels with slightly better TF32 optimization.
> Built-in matmul kernels are pre-compiled. Driver-Only and Full (JIT) modes have identical matmul performance. JIT is only needed for custom kernels.

### PyGPUkit Performance by Matrix Size

| Matrix Size | FP32 | TF32 (Driver-Only) | TF32 (Full) |
|-------------|------|-------------------|-------------|
| 2048×2048 | 8.7 TFLOPS | 12.2 TFLOPS | 13.0 TFLOPS |
| 4096×4096 | 14.2 TFLOPS | 22.0 TFLOPS | 23.5 TFLOPS |
| 8192×8192 | 17.7 TFLOPS | 28.2 TFLOPS | **30.3 TFLOPS** |
| Matrix Size | FP32 | TF32 | FP16 | BF16 |
|-------------|------|------|------|------|
| 2048×2048 | 9.6 TFLOPS | 13.2 TFLOPS | 2.4 TFLOPS | 2.4 TFLOPS |
| 4096×4096 | 14.7 TFLOPS | 22.8 TFLOPS | 2.4 TFLOPS | 2.3 TFLOPS |
| 8192×8192 | 16.7 TFLOPS | 29.7 TFLOPS | 2.3 TFLOPS | 2.3 TFLOPS |

> **Note:** FP16/BF16 matmul uses simple kernels with FP32 accumulation. TensorCore optimization planned for future releases (see [Issue #60](https://github.com/m96-chan/PyGPUkit/issues/60)).

---

Expand Down Expand Up @@ -218,13 +264,14 @@ PyGPUkit/
| **v0.2.2** | Ampere SGEMM (cp.async, float4), 18 TFLOPS FP32 |
| **v0.2.3** | TF32 TensorCore (PTX mma.sync), 28 TFLOPS |
| **v0.2.4** | **Single-binary distribution**, dynamic NVRTC, driver-only mode |
| **v0.2.5** | **FP16/BF16 support**, reduction ops, operator overloads, TF32 v2 (~30 TFLOPS) |

### Planned

| Version | Goals |
|---------|-------|
| **v0.2.5** | Multi-GPU detection, NCCL preliminary support |
| **v0.2.6** | Full API review, documentation, backward compatibility |
| **v0.2.6** | FP16/BF16 TensorCore optimization, Multi-GPU detection |
| **v0.2.7** | Full API review, documentation, backward compatibility |
| **v0.3** | Triton backend, advanced ops (softmax, layernorm), MPS/MIG |

---
Expand Down
Loading