v0.2.6: CUTLASS backend, Multi-LLM concurrent execution#70
Merged
Conversation
Implements FP16 and BF16 TensorCore matmul kernels achieving 50+ TFLOPS. Implementation details: - Uses mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 for FP16 - Uses mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 for BF16 - FP32 accumulation for numerical stability - cp.async double-buffering pipeline - BM=128, BN=128, BK=32 tile configuration - 8 warps (256 threads) per block Benchmark results (RTX 3090 Ti): - FP16 2048x2048: 17.89 TFLOPS - FP16 4096x4096: 37.81 TFLOPS - FP16 8192x8192: 53.05 TFLOPS (71% of cuBLAS) - BF16 2048x2048: 19.04 TFLOPS - BF16 4096x4096: 31.19 TFLOPS - BF16 8192x8192: 52.18 TFLOPS (63% of cuBLAS) Correctness: PASS (normalized error < 2%) Enabled via PYGPUKIT_ALLOW_FP16_TC=1 environment variable. Requires matrix sizes divisible by (128, 128, 32). Closes #60 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add "What's New in v0.2.6" section with FP16/BF16 TensorCore details - Update performance comparison table (FP16: 53 TFLOPS, BF16: 52 TFLOPS) - Update performance by matrix size table with TensorCore results - Update roadmap to mark v0.2.6 TensorCore as completed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements a generic TensorCore kernel supporting arbitrary sizes: - M, N >= 16 - K % 8 == 0 Uses mma.sync.aligned.m16n8k8 with boundary handling. 3-tier dispatch: 1. TC_FAST: M,N % 128 == 0, K % 32 == 0 → 53 TFLOPS 2. TC_GENERIC: M,N >= 16, K % 8 == 0 → ~19 TFLOPS 3. FALLBACK: any size → 2.4 TFLOPS Benchmark results (RTX 3090 Ti, FP16): - 4000x4000 TC_GENERIC: 19.08 TFLOPS (8x faster than fallback) - 3000x3000 TC_GENERIC: 18.52 TFLOPS - 2000x2000 TC_GENERIC: 13.13 TFLOPS - 1000x1000 TC_GENERIC: 4.36 TFLOPS Correctness: PASS for all tested sizes (16x16 to 8192x8192) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Split the monolithic basic.cu (2289 lines) into a modular structure: - common/: shared types, device utils, error handling - elementwise/: add, mul, sub, div kernels and dispatch - unary/: exp, log, relu kernels and dispatch - reduction/: sum, mean, max with warp-shuffle - matmul/: dispatch and FP32 kernels (L2-opt, tiled) Added ops.cuh as public API header. All 166 tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
## Multi-LLM Async Execution - Multi-LLM scheduler for parallel AI model execution - Per-context CUDA streams for stream isolation - VRAM budgeting per model - asyncio integration with `async with context_session()` - 3.37x speedup (GPT-2 + DistilGPT-2 parallel benchmark) ## FP16/BF16 TensorCore Optimization - FP16 TensorCore: 53 TFLOPS (22x faster than v0.2.5) - BF16 TensorCore: 52 TFLOPS (22x faster than v0.2.5) - PTX-level mma.sync.m16n8k16 instructions - FP32 accumulation for numerical stability ## Neural Network Operations - GELU activation (GPU-accelerated) - LayerNorm (GPU-accelerated) - Supports FP16/BF16/FP32/FP64 ## LLM Support - SafeTensors loader (Rust implementation) - Tokenizer support - Model weight loading utilities 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
CUTLASS v4.3.0 is now the default matrix multiplication backend: - TF32 TensorCore for FP32 inputs (31+ TFLOPS) - FP16/BF16 native TensorCore (60+ TFLOPS) Environment variables: - PYGPUKIT_NO_TF32=1: Use native FP32 kernel (full precision) - PYGPUKIT_NO_CUTLASS=1: Disable CUTLASS entirely Build optimizations: - Removed -maxrregcount=128 (was causing 15x slowdown) - Disabled CUDA_SEPARABLE_COMPILATION Performance (RTX 3090 Ti, 8192x8192): - TF32: 31.6 TFLOPS (up from 27 TFLOPS) - FP16: 63 TFLOPS - BF16: 63 TFLOPS 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
All workflow files now checkout with submodules: recursive to include CUTLASS dependency for builds. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add CUTLASS Backend section in v0.2.6 features - Update performance numbers: TF32 31 TFLOPS, FP16/BF16 63 TFLOPS - Update per-size performance table with CUTLASS results - Simplify FP16/BF16 docs (no env var needed with CUTLASS) - Update version history 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update README: Multi-LLM enables concurrent execution, not faster execution, for compute-bound workloads on single GPU - Rewrite demo_v026_multi_llm.py to use PyGPUkit native LLM module: - TransformerBlock, MLP, LayerNorm from pygpukit.llm - CUTLASS TF32 matmul + native gelu/layernorm - Realistic GPT2-style model structure (~100M params) - Remove misleading 3.37x speedup claim (was PyTorch-based) - Clarify scheduler benefits: execution control, stream isolation, VRAM budgeting, concurrent safety 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removed demo_v026_real_models.py which used PyTorch/transformers instead of PyGPUkit native operations. The native demo (demo_v026_multi_llm.py) now properly demonstrates PyGPUkit's LLM module with CUTLASS matmul. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
CUTLASS Integration
third_party/cutlass)PYGPUKIT_NO_TF32=1)Multi-LLM Scheduler
create_context(),context_session()APIPerformance (RTX 3090 Ti, 8192×8192)
Test plan
🤖 Generated with Claude Code