v0.2.7: Epilogue Fusion, Multi-SM Kernels, Documentation by m96-chan · Pull Request #76 · m96-chan/PyGPUkit

m96-chan · 2025-12-15T14:24:02Z

Summary

PyGPUkit v0.2.7 brings CUTLASS epilogue fusion, Multi-SM kernel optimization, and comprehensive documentation.

Features

[v0.2.7] Epilogue Fusion for TensorCore GEMM (bias + activation) #62 Epilogue Fusion: CUTLASS fused linear_bias_gelu operation (single kernel for linear + bias + gelu)
[v0.2.7] Multi-SM CUTLASS kernels with runtime dispatch #68 Multi-SM Kernels: Runtime SM detection with optimized kernel variants
- SM80 (A100): 4-stage pipeline for 48KB shared memory
- SM86+ (RTX 30xx/40xx, H100): 5-stage pipeline for 100KB+ shared memory
[v0.2.7] Optimize Linear layer: eliminate CPU transfers #69 GPU Operations: Native transpose and bias_add_inplace operations
[v0.2.7] Full API Review: consistency and naming conventions #71 API Review: Added missing exports (ops, transpose, bias_add_inplace, linear_bias_gelu)

Documentation (#72, #73)

New comprehensive documentation in docs/:

Guide	Description
`getting-started.md`	Installation, quick start, basic usage
`api.md`	Complete API reference with examples
`llm.md`	SafeTensors, Tokenizer, GPT-2 model guide
`performance.md`	TF32, FP16, CUTLASS optimization
`scheduler.md`	Multi-LLM concurrent execution

API Stability

Documented stable public API for v0.2.x
Deprecation policy defined

Test plan

All 184 tests pass
Lint (ruff) passes
Type check (mypy) passes
Benchmark verified on RTX 3090 Ti

🤖 Generated with Claude Code

- Add optimized transpose kernel using shared memory (32x32 tiles) - Add bias_add_inplace for broadcast bias addition - Update Linear layer to use GPU ops (no CPU transfers) - Remove dead code: basic.cu, basic.cuh (replaced by modular ops/) Performance improvement: - Multi-LLM demo: 856ms -> 41ms (~20x faster) - Linear layer no longer transfers to CPU for transpose/bias 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add fused linear + bias + GELU operation using CUTLASS epilogue fusion. This eliminates intermediate memory writes for MLP layers. New features: - linear_bias_gelu(input, weight, bias): Computes gelu(input @ weight^T + bias) - Support for FP32 (TF32 TensorCore), FP16, and BF16 dtypes - CUTLASS LinearCombinationGELU epilogue for fused computation Performance: - Fused kernel ~5x faster than separate matmul + bias + gelu operations - (0.114ms vs 0.569ms for 128x768x3072 on RTX 3090 Ti) Note: Dimensions must be multiples of 16 for TensorCore compatibility. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When CUTLASS is unavailable or dimensions are not multiples of 16, linear_bias_gelu now falls back to separate matmul + bias_add + gelu operations instead of throwing an error. This ensures the API works for any input dimensions while still using the optimized fused kernel when possible. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…68) - Add SM version detection with caching in matmul_cutlass.cuh - Define SM80 (A100) and SM86+ (RTX 30xx/40xx/H100+) kernel variants - SM80: 4-stage pipeline (optimized for data center) - SM86+: 5-stage pipeline (more shared memory available) - Runtime dispatch selects optimal kernel based on device SM version - Update all GEMM types: TF32, FP16, BF16, and BiasGELU variants - Add MIN_SM_VERSION=80 constant for SM requirement checks - Build targets: SM80, SM86, SM89, SM90, SM100, SM120 Benchmark (RTX 3090 Ti SM86): - TF32 8192x8192: 31.8 TFLOPS - FP16 4096x4096: 46.2 TFLOPS - BF16 4096x4096: 46.8 TFLOPS 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Export `ops` module for advanced usage - Export `transpose` function - Export `bias_add_inplace` for in-place bias addition - Export `linear_bias_gelu` for fused linear+bias+gelu All public functions now properly exported in __all__ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Document CUTLASS epilogue fusion (linear+bias+gelu) - Document Multi-SM CUTLASS kernels (SM80 vs SM86+) - Document new operations (transpose, bias_add_inplace, linear_bias_gelu) - Document API improvements - Update roadmap to show v0.2.7 as released 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Document version policy (v0.2.x backward compatible) - List stable public API functions - Define deprecation policy for future breaking changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Document SafeTensors loading (memory-mapped, zero-copy) - Document Tokenizer API (encode/decode, special tokens) - Document GPT2Model (MLP-only MVP) with usage example - Document model components (Linear, LayerNorm, MLP) - Add LLM classes to API stability table 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

New documentation in docs/: - getting-started.md: Installation, quick start, basic usage - api.md: Complete API reference with examples - llm.md: SafeTensors, Tokenizer, GPT-2 model guide - performance.md: TF32, FP16, CUTLASS optimization guide - scheduler.md: Multi-LLM concurrent execution guide README updates: - Add documentation table with links - Simplify LLM section (detailed docs in docs/llm.md) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implementation: - SM80 (A100): 4-stage pipeline, 48KB shared memory - SM86 (RTX 30xx): 5-stage pipeline, 100KB shared memory - SM89 (RTX 40xx): 6-stage pipeline, 128KB shared memory - SM90+ (H100/Blackwell): CUTLASS 3.x API with TMA/WGMMA Features: - Runtime SM dispatch via get_sm_tier() function - TF32, FP16, BF16 GEMM with optimized tile sizes - BiasGELU fused epilogue variants - matmul_cutlass_sm90.cuh for Hopper/Blackwell architectures Build system: - CUDA 13.1 compatibility for SM100/120 (Blackwell) - Fixed cuCtxCreate_v4 API change in driver_api.hpp - Updated release.yml: Linux SM80-90, Windows SM80-120 - Ninja generator support with build_cuda12/13.bat Benchmark results (RTX 3090 Ti): - FP16: 47.5 TFLOPS (4096x4096) - TF32: 27.4 TFLOPS (8192x8192) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove Windows-specific CUDA paths from pyproject.toml - Default to SM80-90 (CUDA 12.x compatible) - Add vcvars64.bat setup for Windows builds in release.yml - SM100/120 requires CUDA 13.x with env override 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan and others added 11 commits December 15, 2025 22:23

m96-chan merged commit a9b2c5e into main Dec 15, 2025
19 checks passed

m96-chan deleted the feature/v0.2.7 branch December 26, 2025 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.7: Epilogue Fusion, Multi-SM Kernels, Documentation#76

v0.2.7: Epilogue Fusion, Multi-SM Kernels, Documentation#76
m96-chan merged 11 commits intomainfrom
feature/v0.2.7

m96-chan commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

m96-chan commented Dec 15, 2025

Summary

Features

Documentation (#72, #73)

API Stability

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant