Skip to content

v0.2.6: CUTLASS backend, Multi-LLM concurrent execution#70

Merged
m96-chan merged 10 commits intomainfrom
feature/v0.2.6-fp16-tensorcore
Dec 15, 2025
Merged

v0.2.6: CUTLASS backend, Multi-LLM concurrent execution#70
m96-chan merged 10 commits intomainfrom
feature/v0.2.6-fp16-tensorcore

Conversation

@m96-chan
Copy link
Copy Markdown
Owner

Summary

  • CUTLASS backend as default GEMM (31 TFLOPS TF32, 63 TFLOPS FP16/BF16)
  • Multi-LLM concurrent execution with stream isolation and VRAM budgeting
  • FP16/BF16 TensorCore support via CUTLASS
  • LLM module with GPT2-style transformer blocks

Changes

CUTLASS Integration

  • CUTLASS v4.3.0 as git submodule (third_party/cutlass)
  • Automatic TF32 for FP32 inputs (disable with PYGPUKIT_NO_TF32=1)
  • Native FP16/BF16 TensorCore kernels

Multi-LLM Scheduler

  • create_context(), context_session() API
  • Per-context CUDA streams
  • VRAM budgeting per model
  • Concurrent execution (not speedup) for compute-bound workloads

Performance (RTX 3090 Ti, 8192×8192)

dtype TFLOPS
TF32 31+
FP16 63
BF16 63

Test plan

  • All 184 tests pass
  • Benchmark verified on RTX 3090 Ti
  • Multi-LLM demo runs correctly

🤖 Generated with Claude Code

m96-chan and others added 10 commits December 15, 2025 15:51
Implements FP16 and BF16 TensorCore matmul kernels achieving 50+ TFLOPS.

Implementation details:
- Uses mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 for FP16
- Uses mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 for BF16
- FP32 accumulation for numerical stability
- cp.async double-buffering pipeline
- BM=128, BN=128, BK=32 tile configuration
- 8 warps (256 threads) per block

Benchmark results (RTX 3090 Ti):
- FP16 2048x2048: 17.89 TFLOPS
- FP16 4096x4096: 37.81 TFLOPS
- FP16 8192x8192: 53.05 TFLOPS (71% of cuBLAS)
- BF16 2048x2048: 19.04 TFLOPS
- BF16 4096x4096: 31.19 TFLOPS
- BF16 8192x8192: 52.18 TFLOPS (63% of cuBLAS)

Correctness: PASS (normalized error < 2%)

Enabled via PYGPUKIT_ALLOW_FP16_TC=1 environment variable.
Requires matrix sizes divisible by (128, 128, 32).

Closes #60

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add "What's New in v0.2.6" section with FP16/BF16 TensorCore details
- Update performance comparison table (FP16: 53 TFLOPS, BF16: 52 TFLOPS)
- Update performance by matrix size table with TensorCore results
- Update roadmap to mark v0.2.6 TensorCore as completed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements a generic TensorCore kernel supporting arbitrary sizes:
- M, N >= 16
- K % 8 == 0

Uses mma.sync.aligned.m16n8k8 with boundary handling.

3-tier dispatch:
1. TC_FAST: M,N % 128 == 0, K % 32 == 0 → 53 TFLOPS
2. TC_GENERIC: M,N >= 16, K % 8 == 0 → ~19 TFLOPS
3. FALLBACK: any size → 2.4 TFLOPS

Benchmark results (RTX 3090 Ti, FP16):
- 4000x4000 TC_GENERIC: 19.08 TFLOPS (8x faster than fallback)
- 3000x3000 TC_GENERIC: 18.52 TFLOPS
- 2000x2000 TC_GENERIC: 13.13 TFLOPS
- 1000x1000 TC_GENERIC:  4.36 TFLOPS

Correctness: PASS for all tested sizes (16x16 to 8192x8192)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Split the monolithic basic.cu (2289 lines) into a modular structure:

- common/: shared types, device utils, error handling
- elementwise/: add, mul, sub, div kernels and dispatch
- unary/: exp, log, relu kernels and dispatch
- reduction/: sum, mean, max with warp-shuffle
- matmul/: dispatch and FP32 kernels (L2-opt, tiled)

Added ops.cuh as public API header. All 166 tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
## Multi-LLM Async Execution
- Multi-LLM scheduler for parallel AI model execution
- Per-context CUDA streams for stream isolation
- VRAM budgeting per model
- asyncio integration with `async with context_session()`
- 3.37x speedup (GPT-2 + DistilGPT-2 parallel benchmark)

## FP16/BF16 TensorCore Optimization
- FP16 TensorCore: 53 TFLOPS (22x faster than v0.2.5)
- BF16 TensorCore: 52 TFLOPS (22x faster than v0.2.5)
- PTX-level mma.sync.m16n8k16 instructions
- FP32 accumulation for numerical stability

## Neural Network Operations
- GELU activation (GPU-accelerated)
- LayerNorm (GPU-accelerated)
- Supports FP16/BF16/FP32/FP64

## LLM Support
- SafeTensors loader (Rust implementation)
- Tokenizer support
- Model weight loading utilities

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
CUTLASS v4.3.0 is now the default matrix multiplication backend:
- TF32 TensorCore for FP32 inputs (31+ TFLOPS)
- FP16/BF16 native TensorCore (60+ TFLOPS)

Environment variables:
- PYGPUKIT_NO_TF32=1: Use native FP32 kernel (full precision)
- PYGPUKIT_NO_CUTLASS=1: Disable CUTLASS entirely

Build optimizations:
- Removed -maxrregcount=128 (was causing 15x slowdown)
- Disabled CUDA_SEPARABLE_COMPILATION

Performance (RTX 3090 Ti, 8192x8192):
- TF32: 31.6 TFLOPS (up from 27 TFLOPS)
- FP16: 63 TFLOPS
- BF16: 63 TFLOPS

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
All workflow files now checkout with submodules: recursive
to include CUTLASS dependency for builds.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add CUTLASS Backend section in v0.2.6 features
- Update performance numbers: TF32 31 TFLOPS, FP16/BF16 63 TFLOPS
- Update per-size performance table with CUTLASS results
- Simplify FP16/BF16 docs (no env var needed with CUTLASS)
- Update version history

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update README: Multi-LLM enables concurrent execution, not faster
  execution, for compute-bound workloads on single GPU
- Rewrite demo_v026_multi_llm.py to use PyGPUkit native LLM module:
  - TransformerBlock, MLP, LayerNorm from pygpukit.llm
  - CUTLASS TF32 matmul + native gelu/layernorm
  - Realistic GPT2-style model structure (~100M params)
- Remove misleading 3.37x speedup claim (was PyTorch-based)
- Clarify scheduler benefits: execution control, stream isolation,
  VRAM budgeting, concurrent safety

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removed demo_v026_real_models.py which used PyTorch/transformers
instead of PyGPUkit native operations. The native demo
(demo_v026_multi_llm.py) now properly demonstrates PyGPUkit's
LLM module with CUTLASS matmul.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@m96-chan m96-chan merged commit 0555428 into main Dec 15, 2025
15 checks passed
@m96-chan m96-chan deleted the feature/v0.2.6-fp16-tensorcore branch December 26, 2025 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant