v0.2.7: Epilogue Fusion, Multi-SM Kernels, Documentation#76
Merged
Conversation
- Add optimized transpose kernel using shared memory (32x32 tiles) - Add bias_add_inplace for broadcast bias addition - Update Linear layer to use GPU ops (no CPU transfers) - Remove dead code: basic.cu, basic.cuh (replaced by modular ops/) Performance improvement: - Multi-LLM demo: 856ms -> 41ms (~20x faster) - Linear layer no longer transfers to CPU for transpose/bias 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add fused linear + bias + GELU operation using CUTLASS epilogue fusion. This eliminates intermediate memory writes for MLP layers. New features: - linear_bias_gelu(input, weight, bias): Computes gelu(input @ weight^T + bias) - Support for FP32 (TF32 TensorCore), FP16, and BF16 dtypes - CUTLASS LinearCombinationGELU epilogue for fused computation Performance: - Fused kernel ~5x faster than separate matmul + bias + gelu operations - (0.114ms vs 0.569ms for 128x768x3072 on RTX 3090 Ti) Note: Dimensions must be multiples of 16 for TensorCore compatibility. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When CUTLASS is unavailable or dimensions are not multiples of 16, linear_bias_gelu now falls back to separate matmul + bias_add + gelu operations instead of throwing an error. This ensures the API works for any input dimensions while still using the optimized fused kernel when possible. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…68) - Add SM version detection with caching in matmul_cutlass.cuh - Define SM80 (A100) and SM86+ (RTX 30xx/40xx/H100+) kernel variants - SM80: 4-stage pipeline (optimized for data center) - SM86+: 5-stage pipeline (more shared memory available) - Runtime dispatch selects optimal kernel based on device SM version - Update all GEMM types: TF32, FP16, BF16, and BiasGELU variants - Add MIN_SM_VERSION=80 constant for SM requirement checks - Build targets: SM80, SM86, SM89, SM90, SM100, SM120 Benchmark (RTX 3090 Ti SM86): - TF32 8192x8192: 31.8 TFLOPS - FP16 4096x4096: 46.2 TFLOPS - BF16 4096x4096: 46.8 TFLOPS 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Export `ops` module for advanced usage - Export `transpose` function - Export `bias_add_inplace` for in-place bias addition - Export `linear_bias_gelu` for fused linear+bias+gelu All public functions now properly exported in __all__ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Document CUTLASS epilogue fusion (linear+bias+gelu) - Document Multi-SM CUTLASS kernels (SM80 vs SM86+) - Document new operations (transpose, bias_add_inplace, linear_bias_gelu) - Document API improvements - Update roadmap to show v0.2.7 as released 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Document version policy (v0.2.x backward compatible) - List stable public API functions - Define deprecation policy for future breaking changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Document SafeTensors loading (memory-mapped, zero-copy) - Document Tokenizer API (encode/decode, special tokens) - Document GPT2Model (MLP-only MVP) with usage example - Document model components (Linear, LayerNorm, MLP) - Add LLM classes to API stability table 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New documentation in docs/: - getting-started.md: Installation, quick start, basic usage - api.md: Complete API reference with examples - llm.md: SafeTensors, Tokenizer, GPT-2 model guide - performance.md: TF32, FP16, CUTLASS optimization guide - scheduler.md: Multi-LLM concurrent execution guide README updates: - Add documentation table with links - Simplify LLM section (detailed docs in docs/llm.md) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implementation: - SM80 (A100): 4-stage pipeline, 48KB shared memory - SM86 (RTX 30xx): 5-stage pipeline, 100KB shared memory - SM89 (RTX 40xx): 6-stage pipeline, 128KB shared memory - SM90+ (H100/Blackwell): CUTLASS 3.x API with TMA/WGMMA Features: - Runtime SM dispatch via get_sm_tier() function - TF32, FP16, BF16 GEMM with optimized tile sizes - BiasGELU fused epilogue variants - matmul_cutlass_sm90.cuh for Hopper/Blackwell architectures Build system: - CUDA 13.1 compatibility for SM100/120 (Blackwell) - Fixed cuCtxCreate_v4 API change in driver_api.hpp - Updated release.yml: Linux SM80-90, Windows SM80-120 - Ninja generator support with build_cuda12/13.bat Benchmark results (RTX 3090 Ti): - FP16: 47.5 TFLOPS (4096x4096) - TF32: 27.4 TFLOPS (8192x8192) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove Windows-specific CUDA paths from pyproject.toml - Default to SM80-90 (CUDA 12.x compatible) - Add vcvars64.bat setup for Windows builds in release.yml - SM100/120 requires CUDA 13.x with env override 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This was referenced Dec 15, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PyGPUkit v0.2.7 brings CUTLASS epilogue fusion, Multi-SM kernel optimization, and comprehensive documentation.
Features
linear_bias_geluoperation (single kernel for linear + bias + gelu)transposeandbias_add_inplaceoperationsops,transpose,bias_add_inplace,linear_bias_gelu)Documentation (#72, #73)
New comprehensive documentation in
docs/:getting-started.mdapi.mdllm.mdperformance.mdscheduler.mdAPI Stability
Test plan
🤖 Generated with Claude Code