Skip to content

v0.2.5: FP16/BF16 support, TF32 optimization, reduction ops#61

Merged
m96-chan merged 24 commits intomainfrom
feature/v0.2.5-jit-stabilization
Dec 15, 2025
Merged

v0.2.5: FP16/BF16 support, TF32 optimization, reduction ops#61
m96-chan merged 24 commits intomainfrom
feature/v0.2.5-jit-stabilization

Conversation

@m96-chan
Copy link
Copy Markdown
Owner

Summary

Release candidate for v0.2.5 with the following features:

New Features

  • FP16/BF16 Data Types: Half-precision and brain floating point support

    • gpk.float16, gpk.bfloat16 dtypes
    • FP32 accumulation for numerical stability
    • astype() method for dtype conversion
  • Reduction Operations:

    • gpk.sum(a) - Sum of all elements
    • gpk.mean(a) - Mean of all elements
    • gpk.max(a) - Maximum element
  • Operator Overloads:

    • a + b, a - b, a * b, a / b - Elementwise ops
    • a @ b - Matrix multiplication

Performance Improvements

  • TF32 v2 Kernel: PTX mma.sync implementation achieving ~30 TFLOPS (8192×8192)
  • JIT compiler stabilization with persistent kernel cache

Benchmark Results (RTX 3090 Ti)

Matrix Size FP32 TF32 FP16 BF16
2048×2048 9.6 TFLOPS 13.2 TFLOPS 2.4 TFLOPS 2.4 TFLOPS
4096×4096 14.7 TFLOPS 22.8 TFLOPS 2.4 TFLOPS 2.3 TFLOPS
8192×8192 16.7 TFLOPS 29.7 TFLOPS 2.3 TFLOPS 2.3 TFLOPS

Note: FP16/BF16 use simple kernels. TensorCore optimization planned for v0.2.6 (Issue #60).

Documentation

  • Updated README.md with v0.2.5 features and benchmarks
  • Added benchmark.py usage instructions to CLAUDE.md
  • Created examples/demo_v025.py demo script

Test Results

  • 166 tests passed, 3 skipped

🤖 Generated with Claude Code

m96-chan and others added 24 commits December 14, 2025 23:23
## NVRTC Error Handling
- Add NvrtcErrorCode enum (C++ and Python)
- NvrtcError now includes error code and compilation log
- Expose compilation_log property in Python exceptions

## PTX ISA Version Detection & Fallback
- Add get_recommended_arch() for automatic architecture selection
- Add get_fallback_archs() for fallback architecture list
- Auto-retry PTX loading with lower architectures on ISA mismatch

## Retry Logic for Transient Failures
- Retry OutOfMemory, InternalError up to 3 times
- Exponential backoff (100ms, 200ms, 400ms)

## JIT Warmup System
- Add warmup() function for pre-initializing NVRTC
- Support background warmup with callback
- Add is_warmup_done(), get_warmup_error() for status

## Driver Version Documentation
- Add get_driver_requirements() returning min requirements
- Add check_driver_compatibility() for compatibility check
- Minimum: CUDA 11.0+, SM 8.0 (Ampere)+

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add disk-based PTX cache with architecture fingerprinting:

Rust implementation (pygpukit-core):
- PersistentCache: disk-based PTX storage with JSON index
- ArchFingerprint: GPU characteristics for cache key (SM version, memory, driver)
- LRU eviction by entry count and total size
- TTL-based expiration with auto cleanup
- Serde serialization for persistence

Python bindings (pygpukit-python):
- ArchFingerprint: GPU architecture fingerprint
- PersistentCacheConfig: cache directory, size limits, TTL
- PersistentCache: insert, get, remove, cleanup, stats
- PersistentEntry: cached PTX with metadata
- PersistentCacheStats: hit rate, evictions, errors

Features:
- Architecture-aware cache keys (SM version + driver version)
- Automatic directory creation and index management
- Size-based and count-based eviction policies
- Cross-session cache persistence

Tests: 5 new tests for persistent_cache (119 total)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements missing elementwise operations:
- Binary ops: sub (a - b), div (a / b) - all dtypes
- Unary ops: exp, log, relu - float32/float64 only

Each operation includes:
- CUDA kernel implementations (f32, f64, i32, i64 for binary)
- pybind11 bindings with in-place variants
- Python wrappers with CPU fallback
- dtype validation for unary ops

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Research findings for CUTLASS-level TF32 kernel optimization:
- Swizzled shared memory layout (XOR-based bank conflict elimination)
- ldmatrix instruction usage and TF32 limitations
- Multi-stage pipeline considerations (4-stage vs current 2-stage)
- Recommended implementation order with expected gains
- Reference materials from NVIDIA CUTLASS and academic papers

Current: 27.38 TFLOPS → Target: 35+ TFLOPS

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
TF32 v2 kernel baseline (same structure as v1):
- BM=128, BN=128, BK=16
- 2-stage pipeline
- Padding for bank conflicts

Benchmark results (RTX 3090 Ti):
- 2048x2048: 11.12 TFLOPS
- 4096x4096: 20.46 TFLOPS
- 8192x8192: 29.12 TFLOPS

cuBLAS reference:
- 8192x8192: 41.79 TFLOPS

Current efficiency: 70% of cuBLAS
Target: 90% (37.6 TFLOPS)

Correctness: PASS (p99 < 2%)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmark results (RTX 3090 Ti):
- 2048x2048: 10.83 TFLOPS (cuBLAS: 30.60)
- 4096x4096: 19.09 TFLOPS (cuBLAS: 35.49)
- 8192x8192: 25.48 TFLOPS (cuBLAS: 41.79)

Correctness: PASS

Note: WMMA API is slower than PTX mma.sync (was ~29 TFLOPS).
Need to return to PTX with better optimizations.
Benchmark results (RTX 3090 Ti):
- 2048x2048: 10.36 TFLOPS (cuBLAS: 30.60)
- 4096x4096: 17.86 TFLOPS (cuBLAS: 35.49)
- 8192x8192: 24.26 TFLOPS (cuBLAS: 41.79)

Correctness: PASS

Note: BK=32 reduces occupancy. Need to optimize.
Benchmark results (RTX 3090 Ti):
- 2048x2048: 10.94 TFLOPS (cuBLAS: 30.60)
- 4096x4096: 18.85 TFLOPS (cuBLAS: 35.49)
- 8192x8192: 24.93 TFLOPS (cuBLAS: 41.79)

Correctness: PASS
Benchmark results (RTX 3090 Ti):
- 2048x2048: 10.83 TFLOPS (cuBLAS: 30.60)
- 4096x4096: 18.75 TFLOPS (cuBLAS: 35.49)
- 8192x8192: 25.22 TFLOPS (cuBLAS: 41.79)

Correctness: PASS

Note: 3-stage uses too much smem, reduces occupancy.
v1 (2-stage) achieves 30.10 TFLOPS. Need different approach.
Benchmark results (RTX 3090 Ti):
- 2048x2048: 11.47 TFLOPS (cuBLAS: 30.60)
- 4096x4096: 21.30 TFLOPS (cuBLAS: 35.49)
- 8192x8192: 29.94 TFLOPS (cuBLAS: 41.79)

Correctness: PASS

Matches v1 baseline (30.10 TFLOPS).
Need 25% improvement to reach 90% cuBLAS target.
Summary of attempts:
- 256x128 tile: 26.85 TFLOPS (occupancy issue)
- 64x256 tile: 21.33 TFLOPS (too much B loading)
- 3-stage BK=8: 27.44 TFLOPS (too many K iterations)
- v1 baseline: 30.10 TFLOPS (best so far)

cuBLAS reference: 41.79 TFLOPS @ 8192
Target: 37.6 TFLOPS (90%)

Next: try Split-K parallelization
Benchmark results (RTX 3090 Ti):
- 2048x2048: 12.16 TFLOPS
- 4096x4096: 21.64 TFLOPS
- 8192x8192: 30.45 TFLOPS (73% of cuBLAS)

Correctness: PASS (p99 rel error < 2%)
Target: 37.6 TFLOPS (90% of cuBLAS) - 24% gap remaining

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Multiple optimization approaches tried:
- Double nested pipeline (GMEM→SMEM + SMEM→RMEM)
- Preloaded A fragments
- L1 caching (cp.async.ca vs .cg)
- BK=32 (too much smem)
- 3-stage pipeline
- Register double buffering

Benchmark results (RTX 3090 Ti):
- 2048x2048: ~12 TFLOPS
- 4096x4096: ~21 TFLOPS
- 8192x8192: ~29 TFLOPS (69% of cuBLAS 41.79)

Correctness: PASS (p99 rel error < 2%)
Target: 37.6 TFLOPS (90% of cuBLAS)

Additional optimizations needed to reach 90%:
- ldmatrix for efficient fragment loading
- Swizzled shared memory
- Different tile configurations

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Types:
- Add float16, bfloat16 to Python dtypes
- Add Float16, BFloat16 to C++ DataType enum

Elementwise ops (FP16/BF16):
- add, mul, sub, div kernels with FP32 intermediate

Matmul:
- Add matmul_f16_bf16.cuh with simple kernels (FP32 accumulation)

Reduction ops:
- Add sum, mean, max (Python API + C++ placeholders)

Status: WIP - needs build verification and testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add arithmetic operators: +, -, *, /, @
- Add astype() for dtype conversion
- Handle BF16 <-> FP32 conversion correctly

Test results (RTX 3090 Ti):
- FP16 elementwise: PASS
- BF16 elementwise: PASS
- FP16 matmul: PASS (rel error < 0.05)
- BF16 matmul: PASS (rel error < 0.05)
- Reduction ops (sum, mean, max): PASS

Benchmark (simple kernels, no TensorCore):
- FP16 4096x4096: 2.18 TFLOPS
- BF16 4096x4096: 2.16 TFLOPS
- FP32 4096x4096: 6.29 TFLOPS (reference)

Note: FP16/BF16 matmul uses naive kernels. TensorCore optimization planned.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add examples/demo_v025.py with full feature demonstration
- Update README.md with v0.2.5 features (FP16/BF16, reductions, operators)
- Add FP16/BF16 benchmark results to performance table
- Update roadmap: v0.2.5 released, v0.2.6+ planned

Demo output (RTX 3090 Ti):
- FP16/BF16 elementwise: PASS
- FP16/BF16 matmul: PASS
- Reduction ops: PASS

Benchmark (8192x8192):
- FP32: 12.7 TFLOPS
- TF32: 13.0 TFLOPS
- FP16: 2.3 TFLOPS (simple kernel)
- BF16: 2.2 TFLOPS (simple kernel)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmarks all dtypes (FP32, TF32, FP16, BF16) with:
- Correctness verification
- Performance measurement (median TFLOPS)
- README-compatible markdown output
- Mode detection (Driver-Only vs Full JIT)

Usage:
  python benchmark_all.py [--sizes 2048,4096,8192] [--quick]

Results (RTX 3090 Ti, Full mode):
| Size | FP32 | TF32 | FP16 | BF16 |
|------|------|------|------|------|
| 2048 | 13.2 | 13.3 | 2.4  | 2.4  |
| 4096 | 22.6 | 23.6 | 2.4  | 2.4  |
| 8192 | 30.3 | 30.2 | 2.4  | 2.3  |

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rewrote benchmark_all.py to use env vars for TF32 kernel selection
- Added --tf32-version v1|v2 option
- Clarified Driver-Only vs JIT modes (same matmul performance)
- Updated README.md with accurate kernel-only timing results

Benchmark results (RTX 3090 Ti):
- FP32: 9.6 / 14.7 / 16.7 TFLOPS (2k/4k/8k)
- TF32 v2: 13.2 / 22.8 / 29.7 TFLOPS (2k/4k/8k)
- FP16: ~2.4 TFLOPS
- BF16: ~2.3 TFLOPS

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Renamed benchmark_all.py → benchmark.py
- Deleted redundant files: benchmark_tf32.py, benchmark_ampere.py, bench_tf32_v2.py
- Kept: benchmark_rust.py (scheduler), benchmark_pytorch.py (cuBLAS comparison)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- CLAUDE.md: add benchmark.py usage instructions
- README.md: update v0.2.5 highlights with TF32 v2 (~30 TFLOPS)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused sys import
- Fix bare except clause

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix import sorting (I001) in multiple files
- Remove unused imports (F401)
- Remove unused f-string prefixes (F541)
- Add per-file-ignores for examples/ and compiler.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix mypy error: add type annotation for 'converted' variable
- Format all tracked Python files with ruff
- Add comprehensive PR checklist (lint, mypy, tests, benchmark)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@m96-chan m96-chan deleted the feature/v0.2.5-jit-stabilization branch December 26, 2025 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant