v0.2.5: FP16/BF16 support, TF32 optimization, reduction ops by m96-chan · Pull Request #61 · m96-chan/PyGPUkit

m96-chan · 2025-12-15T05:57:20Z

Summary

Release candidate for v0.2.5 with the following features:

New Features

FP16/BF16 Data Types: Half-precision and brain floating point support
- gpk.float16, gpk.bfloat16 dtypes
- FP32 accumulation for numerical stability
- astype() method for dtype conversion
Reduction Operations:
- gpk.sum(a) - Sum of all elements
- gpk.mean(a) - Mean of all elements
- gpk.max(a) - Maximum element
Operator Overloads:
- a + b, a - b, a * b, a / b - Elementwise ops
- a @ b - Matrix multiplication

Performance Improvements

TF32 v2 Kernel: PTX mma.sync implementation achieving ~30 TFLOPS (8192×8192)
JIT compiler stabilization with persistent kernel cache

Benchmark Results (RTX 3090 Ti)

Matrix Size	FP32	TF32	FP16	BF16
2048×2048	9.6 TFLOPS	13.2 TFLOPS	2.4 TFLOPS	2.4 TFLOPS
4096×4096	14.7 TFLOPS	22.8 TFLOPS	2.4 TFLOPS	2.3 TFLOPS
8192×8192	16.7 TFLOPS	29.7 TFLOPS	2.3 TFLOPS	2.3 TFLOPS

Note: FP16/BF16 use simple kernels. TensorCore optimization planned for v0.2.6 (Issue #60).

Documentation

Updated README.md with v0.2.5 features and benchmarks
Added benchmark.py usage instructions to CLAUDE.md
Created examples/demo_v025.py demo script

Test Results

166 tests passed, 3 skipped

🤖 Generated with Claude Code

## NVRTC Error Handling - Add NvrtcErrorCode enum (C++ and Python) - NvrtcError now includes error code and compilation log - Expose compilation_log property in Python exceptions ## PTX ISA Version Detection & Fallback - Add get_recommended_arch() for automatic architecture selection - Add get_fallback_archs() for fallback architecture list - Auto-retry PTX loading with lower architectures on ISA mismatch ## Retry Logic for Transient Failures - Retry OutOfMemory, InternalError up to 3 times - Exponential backoff (100ms, 200ms, 400ms) ## JIT Warmup System - Add warmup() function for pre-initializing NVRTC - Support background warmup with callback - Add is_warmup_done(), get_warmup_error() for status ## Driver Version Documentation - Add get_driver_requirements() returning min requirements - Add check_driver_compatibility() for compatibility check - Minimum: CUDA 11.0+, SM 8.0 (Ampere)+ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add disk-based PTX cache with architecture fingerprinting: Rust implementation (pygpukit-core): - PersistentCache: disk-based PTX storage with JSON index - ArchFingerprint: GPU characteristics for cache key (SM version, memory, driver) - LRU eviction by entry count and total size - TTL-based expiration with auto cleanup - Serde serialization for persistence Python bindings (pygpukit-python): - ArchFingerprint: GPU architecture fingerprint - PersistentCacheConfig: cache directory, size limits, TTL - PersistentCache: insert, get, remove, cleanup, stats - PersistentEntry: cached PTX with metadata - PersistentCacheStats: hit rate, evictions, errors Features: - Architecture-aware cache keys (SM version + driver version) - Automatic directory creation and index management - Size-based and count-based eviction policies - Cross-session cache persistence Tests: 5 new tests for persistent_cache (119 total) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implements missing elementwise operations: - Binary ops: sub (a - b), div (a / b) - all dtypes - Unary ops: exp, log, relu - float32/float64 only Each operation includes: - CUDA kernel implementations (f32, f64, i32, i64 for binary) - pybind11 bindings with in-place variants - Python wrappers with CPU fallback - dtype validation for unary ops 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Research findings for CUTLASS-level TF32 kernel optimization: - Swizzled shared memory layout (XOR-based bank conflict elimination) - ldmatrix instruction usage and TF32 limitations - Multi-stage pipeline considerations (4-stage vs current 2-stage) - Recommended implementation order with expected gains - Reference materials from NVIDIA CUTLASS and academic papers Current: 27.38 TFLOPS → Target: 35+ TFLOPS 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

TF32 v2 kernel baseline (same structure as v1): - BM=128, BN=128, BK=16 - 2-stage pipeline - Padding for bank conflicts Benchmark results (RTX 3090 Ti): - 2048x2048: 11.12 TFLOPS - 4096x4096: 20.46 TFLOPS - 8192x8192: 29.12 TFLOPS cuBLAS reference: - 8192x8192: 41.79 TFLOPS Current efficiency: 70% of cuBLAS Target: 90% (37.6 TFLOPS) Correctness: PASS (p99 < 2%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Benchmark results (RTX 3090 Ti): - 2048x2048: 10.83 TFLOPS (cuBLAS: 30.60) - 4096x4096: 19.09 TFLOPS (cuBLAS: 35.49) - 8192x8192: 25.48 TFLOPS (cuBLAS: 41.79) Correctness: PASS Note: WMMA API is slower than PTX mma.sync (was ~29 TFLOPS). Need to return to PTX with better optimizations.

Benchmark results (RTX 3090 Ti): - 2048x2048: 10.36 TFLOPS (cuBLAS: 30.60) - 4096x4096: 17.86 TFLOPS (cuBLAS: 35.49) - 8192x8192: 24.26 TFLOPS (cuBLAS: 41.79) Correctness: PASS Note: BK=32 reduces occupancy. Need to optimize.

Benchmark results (RTX 3090 Ti): - 2048x2048: 10.94 TFLOPS (cuBLAS: 30.60) - 4096x4096: 18.85 TFLOPS (cuBLAS: 35.49) - 8192x8192: 24.93 TFLOPS (cuBLAS: 41.79) Correctness: PASS

Benchmark results (RTX 3090 Ti): - 2048x2048: 10.83 TFLOPS (cuBLAS: 30.60) - 4096x4096: 18.75 TFLOPS (cuBLAS: 35.49) - 8192x8192: 25.22 TFLOPS (cuBLAS: 41.79) Correctness: PASS Note: 3-stage uses too much smem, reduces occupancy. v1 (2-stage) achieves 30.10 TFLOPS. Need different approach.

Benchmark results (RTX 3090 Ti): - 2048x2048: 11.47 TFLOPS (cuBLAS: 30.60) - 4096x4096: 21.30 TFLOPS (cuBLAS: 35.49) - 8192x8192: 29.94 TFLOPS (cuBLAS: 41.79) Correctness: PASS Matches v1 baseline (30.10 TFLOPS). Need 25% improvement to reach 90% cuBLAS target.

Summary of attempts: - 256x128 tile: 26.85 TFLOPS (occupancy issue) - 64x256 tile: 21.33 TFLOPS (too much B loading) - 3-stage BK=8: 27.44 TFLOPS (too many K iterations) - v1 baseline: 30.10 TFLOPS (best so far) cuBLAS reference: 41.79 TFLOPS @ 8192 Target: 37.6 TFLOPS (90%) Next: try Split-K parallelization

Benchmark results (RTX 3090 Ti): - 2048x2048: 12.16 TFLOPS - 4096x4096: 21.64 TFLOPS - 8192x8192: 30.45 TFLOPS (73% of cuBLAS) Correctness: PASS (p99 rel error < 2%) Target: 37.6 TFLOPS (90% of cuBLAS) - 24% gap remaining 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Multiple optimization approaches tried: - Double nested pipeline (GMEM→SMEM + SMEM→RMEM) - Preloaded A fragments - L1 caching (cp.async.ca vs .cg) - BK=32 (too much smem) - 3-stage pipeline - Register double buffering Benchmark results (RTX 3090 Ti): - 2048x2048: ~12 TFLOPS - 4096x4096: ~21 TFLOPS - 8192x8192: ~29 TFLOPS (69% of cuBLAS 41.79) Correctness: PASS (p99 rel error < 2%) Target: 37.6 TFLOPS (90% of cuBLAS) Additional optimizations needed to reach 90%: - ldmatrix for efficient fragment loading - Swizzled shared memory - Different tile configurations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Types: - Add float16, bfloat16 to Python dtypes - Add Float16, BFloat16 to C++ DataType enum Elementwise ops (FP16/BF16): - add, mul, sub, div kernels with FP32 intermediate Matmul: - Add matmul_f16_bf16.cuh with simple kernels (FP32 accumulation) Reduction ops: - Add sum, mean, max (Python API + C++ placeholders) Status: WIP - needs build verification and testing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add arithmetic operators: +, -, *, /, @ - Add astype() for dtype conversion - Handle BF16 <-> FP32 conversion correctly Test results (RTX 3090 Ti): - FP16 elementwise: PASS - BF16 elementwise: PASS - FP16 matmul: PASS (rel error < 0.05) - BF16 matmul: PASS (rel error < 0.05) - Reduction ops (sum, mean, max): PASS Benchmark (simple kernels, no TensorCore): - FP16 4096x4096: 2.18 TFLOPS - BF16 4096x4096: 2.16 TFLOPS - FP32 4096x4096: 6.29 TFLOPS (reference) Note: FP16/BF16 matmul uses naive kernels. TensorCore optimization planned. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add examples/demo_v025.py with full feature demonstration - Update README.md with v0.2.5 features (FP16/BF16, reductions, operators) - Add FP16/BF16 benchmark results to performance table - Update roadmap: v0.2.5 released, v0.2.6+ planned Demo output (RTX 3090 Ti): - FP16/BF16 elementwise: PASS - FP16/BF16 matmul: PASS - Reduction ops: PASS Benchmark (8192x8192): - FP32: 12.7 TFLOPS - TF32: 13.0 TFLOPS - FP16: 2.3 TFLOPS (simple kernel) - BF16: 2.2 TFLOPS (simple kernel) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Benchmarks all dtypes (FP32, TF32, FP16, BF16) with: - Correctness verification - Performance measurement (median TFLOPS) - README-compatible markdown output - Mode detection (Driver-Only vs Full JIT) Usage: python benchmark_all.py [--sizes 2048,4096,8192] [--quick] Results (RTX 3090 Ti, Full mode): | Size | FP32 | TF32 | FP16 | BF16 | |------|------|------|------|------| | 2048 | 13.2 | 13.3 | 2.4 | 2.4 | | 4096 | 22.6 | 23.6 | 2.4 | 2.4 | | 8192 | 30.3 | 30.2 | 2.4 | 2.3 | 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Rewrote benchmark_all.py to use env vars for TF32 kernel selection - Added --tf32-version v1|v2 option - Clarified Driver-Only vs JIT modes (same matmul performance) - Updated README.md with accurate kernel-only timing results Benchmark results (RTX 3090 Ti): - FP32: 9.6 / 14.7 / 16.7 TFLOPS (2k/4k/8k) - TF32 v2: 13.2 / 22.8 / 29.7 TFLOPS (2k/4k/8k) - FP16: ~2.4 TFLOPS - BF16: ~2.3 TFLOPS 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Renamed benchmark_all.py → benchmark.py - Deleted redundant files: benchmark_tf32.py, benchmark_ampere.py, bench_tf32_v2.py - Kept: benchmark_rust.py (scheduler), benchmark_pytorch.py (cuBLAS comparison) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- CLAUDE.md: add benchmark.py usage instructions - README.md: update v0.2.5 highlights with TF32 v2 (~30 TFLOPS) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove unused sys import - Fix bare except clause 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Fix import sorting (I001) in multiple files - Remove unused imports (F401) - Remove unused f-string prefixes (F541) - Add per-file-ignores for examples/ and compiler.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Fix mypy error: add type annotation for 'converted' variable - Format all tracked Python files with ruff - Add comprehensive PR checklist (lint, mypy, tests, benchmark) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan and others added 24 commits December 14, 2025 23:23

wip(tf32-v2): PTX m16n8k8 with BK=32 - correct but slower

6c66943

Benchmark results (RTX 3090 Ti): - 2048x2048: 10.36 TFLOPS (cuBLAS: 30.60) - 4096x4096: 17.86 TFLOPS (cuBLAS: 35.49) - 8192x8192: 24.26 TFLOPS (cuBLAS: 41.79) Correctness: PASS Note: BK=32 reduces occupancy. Need to optimize.

wip(tf32-v2): WMMA BK=16 with dynamic smem

d9c2d73

Benchmark results (RTX 3090 Ti): - 2048x2048: 10.94 TFLOPS (cuBLAS: 30.60) - 4096x4096: 18.85 TFLOPS (cuBLAS: 35.49) - 8192x8192: 24.93 TFLOPS (cuBLAS: 41.79) Correctness: PASS

docs: update for v0.2.5 release

82afa7e

- CLAUDE.md: add benchmark.py usage instructions - README.md: update v0.2.5 highlights with TF32 v2 (~30 TFLOPS) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: lint errors in benchmark.py

4ba0716

- Remove unused sys import - Fix bare except clause 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs: add mandatory lint check rule to CLAUDE.md

739f18a

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan merged commit 854719b into main Dec 15, 2025
13 checks passed

This was referenced Dec 15, 2025

[v0.2.5+] Support Types #58

Closed

[v0.2.5+] Support Operator #59

Closed

[v0.2.5+] Kernel Cache LRU Persistence #54

Closed

[v0.2.5+] JIT Compiler Stabilization (NVRTC Robustness) #55

Closed

[v1.0] TF32 Kernel Further Optimization (CUTLASS-level) #53

Open

m96-chan deleted the feature/v0.2.5-jit-stabilization branch December 26, 2025 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.5: FP16/BF16 support, TF32 optimization, reduction ops#61

v0.2.5: FP16/BF16 support, TF32 optimization, reduction ops#61
m96-chan merged 24 commits intomainfrom
feature/v0.2.5-jit-stabilization

m96-chan commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

m96-chan commented Dec 15, 2025

Summary

New Features

Performance Improvements

Benchmark Results (RTX 3090 Ti)

Documentation

Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant