v0.2.6: CUTLASS backend, Multi-LLM concurrent execution by m96-chan · Pull Request #70 · m96-chan/PyGPUkit

m96-chan · 2025-12-15T12:55:50Z

Summary

CUTLASS backend as default GEMM (31 TFLOPS TF32, 63 TFLOPS FP16/BF16)
Multi-LLM concurrent execution with stream isolation and VRAM budgeting
FP16/BF16 TensorCore support via CUTLASS
LLM module with GPT2-style transformer blocks

Changes

CUTLASS Integration

CUTLASS v4.3.0 as git submodule (third_party/cutlass)
Automatic TF32 for FP32 inputs (disable with PYGPUKIT_NO_TF32=1)
Native FP16/BF16 TensorCore kernels

Multi-LLM Scheduler

create_context(), context_session() API
Per-context CUDA streams
VRAM budgeting per model
Concurrent execution (not speedup) for compute-bound workloads

Performance (RTX 3090 Ti, 8192×8192)

dtype	TFLOPS
TF32	31+
FP16	63
BF16	63

Test plan

All 184 tests pass
Benchmark verified on RTX 3090 Ti
Multi-LLM demo runs correctly

🤖 Generated with Claude Code

Implements FP16 and BF16 TensorCore matmul kernels achieving 50+ TFLOPS. Implementation details: - Uses mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 for FP16 - Uses mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 for BF16 - FP32 accumulation for numerical stability - cp.async double-buffering pipeline - BM=128, BN=128, BK=32 tile configuration - 8 warps (256 threads) per block Benchmark results (RTX 3090 Ti): - FP16 2048x2048: 17.89 TFLOPS - FP16 4096x4096: 37.81 TFLOPS - FP16 8192x8192: 53.05 TFLOPS (71% of cuBLAS) - BF16 2048x2048: 19.04 TFLOPS - BF16 4096x4096: 31.19 TFLOPS - BF16 8192x8192: 52.18 TFLOPS (63% of cuBLAS) Correctness: PASS (normalized error < 2%) Enabled via PYGPUKIT_ALLOW_FP16_TC=1 environment variable. Requires matrix sizes divisible by (128, 128, 32). Closes #60 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add "What's New in v0.2.6" section with FP16/BF16 TensorCore details - Update performance comparison table (FP16: 53 TFLOPS, BF16: 52 TFLOPS) - Update performance by matrix size table with TensorCore results - Update roadmap to mark v0.2.6 TensorCore as completed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implements a generic TensorCore kernel supporting arbitrary sizes: - M, N >= 16 - K % 8 == 0 Uses mma.sync.aligned.m16n8k8 with boundary handling. 3-tier dispatch: 1. TC_FAST: M,N % 128 == 0, K % 32 == 0 → 53 TFLOPS 2. TC_GENERIC: M,N >= 16, K % 8 == 0 → ~19 TFLOPS 3. FALLBACK: any size → 2.4 TFLOPS Benchmark results (RTX 3090 Ti, FP16): - 4000x4000 TC_GENERIC: 19.08 TFLOPS (8x faster than fallback) - 3000x3000 TC_GENERIC: 18.52 TFLOPS - 2000x2000 TC_GENERIC: 13.13 TFLOPS - 1000x1000 TC_GENERIC: 4.36 TFLOPS Correctness: PASS for all tested sizes (16x16 to 8192x8192) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Split the monolithic basic.cu (2289 lines) into a modular structure: - common/: shared types, device utils, error handling - elementwise/: add, mul, sub, div kernels and dispatch - unary/: exp, log, relu kernels and dispatch - reduction/: sum, mean, max with warp-shuffle - matmul/: dispatch and FP32 kernels (L2-opt, tiled) Added ops.cuh as public API header. All 166 tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

## Multi-LLM Async Execution - Multi-LLM scheduler for parallel AI model execution - Per-context CUDA streams for stream isolation - VRAM budgeting per model - asyncio integration with `async with context_session()` - 3.37x speedup (GPT-2 + DistilGPT-2 parallel benchmark) ## FP16/BF16 TensorCore Optimization - FP16 TensorCore: 53 TFLOPS (22x faster than v0.2.5) - BF16 TensorCore: 52 TFLOPS (22x faster than v0.2.5) - PTX-level mma.sync.m16n8k16 instructions - FP32 accumulation for numerical stability ## Neural Network Operations - GELU activation (GPU-accelerated) - LayerNorm (GPU-accelerated) - Supports FP16/BF16/FP32/FP64 ## LLM Support - SafeTensors loader (Rust implementation) - Tokenizer support - Model weight loading utilities 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

CUTLASS v4.3.0 is now the default matrix multiplication backend: - TF32 TensorCore for FP32 inputs (31+ TFLOPS) - FP16/BF16 native TensorCore (60+ TFLOPS) Environment variables: - PYGPUKIT_NO_TF32=1: Use native FP32 kernel (full precision) - PYGPUKIT_NO_CUTLASS=1: Disable CUTLASS entirely Build optimizations: - Removed -maxrregcount=128 (was causing 15x slowdown) - Disabled CUDA_SEPARABLE_COMPILATION Performance (RTX 3090 Ti, 8192x8192): - TF32: 31.6 TFLOPS (up from 27 TFLOPS) - FP16: 63 TFLOPS - BF16: 63 TFLOPS 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

All workflow files now checkout with submodules: recursive to include CUTLASS dependency for builds. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add CUTLASS Backend section in v0.2.6 features - Update performance numbers: TF32 31 TFLOPS, FP16/BF16 63 TFLOPS - Update per-size performance table with CUTLASS results - Simplify FP16/BF16 docs (no env var needed with CUTLASS) - Update version history 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Update README: Multi-LLM enables concurrent execution, not faster execution, for compute-bound workloads on single GPU - Rewrite demo_v026_multi_llm.py to use PyGPUkit native LLM module: - TransformerBlock, MLP, LayerNorm from pygpukit.llm - CUTLASS TF32 matmul + native gelu/layernorm - Realistic GPT2-style model structure (~100M params) - Remove misleading 3.37x speedup claim (was PyTorch-based) - Clarify scheduler benefits: execution control, stream isolation, VRAM budgeting, concurrent safety 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Removed demo_v026_real_models.py which used PyTorch/transformers instead of PyGPUkit native operations. The native demo (demo_v026_multi_llm.py) now properly demonstrates PyGPUkit's LLM module with CUTLASS matmul. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan and others added 10 commits December 15, 2025 15:51

ci: add submodules checkout for CUTLASS

19af554

All workflow files now checkout with submodules: recursive to include CUTLASS dependency for builds. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan merged commit 0555428 into main Dec 15, 2025
15 checks passed

m96-chan deleted the feature/v0.2.6-fp16-tensorcore branch December 26, 2025 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.6: CUTLASS backend, Multi-LLM concurrent execution#70

v0.2.6: CUTLASS backend, Multi-LLM concurrent execution#70
m96-chan merged 10 commits intomainfrom
feature/v0.2.6-fp16-tensorcore

m96-chan commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

m96-chan commented Dec 15, 2025

Summary

Changes

CUTLASS Integration

Multi-LLM Scheduler

Performance (RTX 3090 Ti, 8192×8192)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant