Skip to content

v0.2.7: Epilogue Fusion, Multi-SM Kernels, Documentation#76

Merged
m96-chan merged 11 commits intomainfrom
feature/v0.2.7
Dec 15, 2025
Merged

v0.2.7: Epilogue Fusion, Multi-SM Kernels, Documentation#76
m96-chan merged 11 commits intomainfrom
feature/v0.2.7

Conversation

@m96-chan
Copy link
Copy Markdown
Owner

Summary

PyGPUkit v0.2.7 brings CUTLASS epilogue fusion, Multi-SM kernel optimization, and comprehensive documentation.

Features

Documentation (#72, #73)

New comprehensive documentation in docs/:

Guide Description
getting-started.md Installation, quick start, basic usage
api.md Complete API reference with examples
llm.md SafeTensors, Tokenizer, GPT-2 model guide
performance.md TF32, FP16, CUTLASS optimization
scheduler.md Multi-LLM concurrent execution

API Stability

  • Documented stable public API for v0.2.x
  • Deprecation policy defined

Test plan

  • All 184 tests pass
  • Lint (ruff) passes
  • Type check (mypy) passes
  • Benchmark verified on RTX 3090 Ti

🤖 Generated with Claude Code

m96-chan and others added 11 commits December 15, 2025 22:23
- Add optimized transpose kernel using shared memory (32x32 tiles)
- Add bias_add_inplace for broadcast bias addition
- Update Linear layer to use GPU ops (no CPU transfers)
- Remove dead code: basic.cu, basic.cuh (replaced by modular ops/)

Performance improvement:
- Multi-LLM demo: 856ms -> 41ms (~20x faster)
- Linear layer no longer transfers to CPU for transpose/bias

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add fused linear + bias + GELU operation using CUTLASS epilogue fusion.
This eliminates intermediate memory writes for MLP layers.

New features:
- linear_bias_gelu(input, weight, bias): Computes gelu(input @ weight^T + bias)
- Support for FP32 (TF32 TensorCore), FP16, and BF16 dtypes
- CUTLASS LinearCombinationGELU epilogue for fused computation

Performance:
- Fused kernel ~5x faster than separate matmul + bias + gelu operations
- (0.114ms vs 0.569ms for 128x768x3072 on RTX 3090 Ti)

Note: Dimensions must be multiples of 16 for TensorCore compatibility.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When CUTLASS is unavailable or dimensions are not multiples of 16,
linear_bias_gelu now falls back to separate matmul + bias_add + gelu
operations instead of throwing an error.

This ensures the API works for any input dimensions while still
using the optimized fused kernel when possible.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…68)

- Add SM version detection with caching in matmul_cutlass.cuh
- Define SM80 (A100) and SM86+ (RTX 30xx/40xx/H100+) kernel variants
- SM80: 4-stage pipeline (optimized for data center)
- SM86+: 5-stage pipeline (more shared memory available)
- Runtime dispatch selects optimal kernel based on device SM version
- Update all GEMM types: TF32, FP16, BF16, and BiasGELU variants
- Add MIN_SM_VERSION=80 constant for SM requirement checks
- Build targets: SM80, SM86, SM89, SM90, SM100, SM120

Benchmark (RTX 3090 Ti SM86):
- TF32 8192x8192: 31.8 TFLOPS
- FP16 4096x4096: 46.2 TFLOPS
- BF16 4096x4096: 46.8 TFLOPS

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Export `ops` module for advanced usage
- Export `transpose` function
- Export `bias_add_inplace` for in-place bias addition
- Export `linear_bias_gelu` for fused linear+bias+gelu

All public functions now properly exported in __all__

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Document CUTLASS epilogue fusion (linear+bias+gelu)
- Document Multi-SM CUTLASS kernels (SM80 vs SM86+)
- Document new operations (transpose, bias_add_inplace, linear_bias_gelu)
- Document API improvements
- Update roadmap to show v0.2.7 as released

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Document version policy (v0.2.x backward compatible)
- List stable public API functions
- Define deprecation policy for future breaking changes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Document SafeTensors loading (memory-mapped, zero-copy)
- Document Tokenizer API (encode/decode, special tokens)
- Document GPT2Model (MLP-only MVP) with usage example
- Document model components (Linear, LayerNorm, MLP)
- Add LLM classes to API stability table

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New documentation in docs/:
- getting-started.md: Installation, quick start, basic usage
- api.md: Complete API reference with examples
- llm.md: SafeTensors, Tokenizer, GPT-2 model guide
- performance.md: TF32, FP16, CUTLASS optimization guide
- scheduler.md: Multi-LLM concurrent execution guide

README updates:
- Add documentation table with links
- Simplify LLM section (detailed docs in docs/llm.md)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implementation:
- SM80 (A100): 4-stage pipeline, 48KB shared memory
- SM86 (RTX 30xx): 5-stage pipeline, 100KB shared memory
- SM89 (RTX 40xx): 6-stage pipeline, 128KB shared memory
- SM90+ (H100/Blackwell): CUTLASS 3.x API with TMA/WGMMA

Features:
- Runtime SM dispatch via get_sm_tier() function
- TF32, FP16, BF16 GEMM with optimized tile sizes
- BiasGELU fused epilogue variants
- matmul_cutlass_sm90.cuh for Hopper/Blackwell architectures

Build system:
- CUDA 13.1 compatibility for SM100/120 (Blackwell)
- Fixed cuCtxCreate_v4 API change in driver_api.hpp
- Updated release.yml: Linux SM80-90, Windows SM80-120
- Ninja generator support with build_cuda12/13.bat

Benchmark results (RTX 3090 Ti):
- FP16: 47.5 TFLOPS (4096x4096)
- TF32: 27.4 TFLOPS (8192x8192)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove Windows-specific CUDA paths from pyproject.toml
- Default to SM80-90 (CUDA 12.x compatible)
- Add vcvars64.bat setup for Windows builds in release.yml
- SM100/120 requires CUDA 13.x with env override

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant