m96-chan · m96-chan · Dec 15, 2025 · Dec 15, 2025 · Dec 15, 2025 · Dec 15, 2025
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -89,7 +89,9 @@ jobs:
         run: |
           python -m build --wheel
         env:
-          CMAKE_CUDA_ARCHITECTURES: "70;75;80;86;89;90"
+          # PyGPUkit requires SM >= 80 (Ampere and newer)
+          # SM100/120 (Blackwell) requires CUDA 13.x - not available on GitHub runners yet
+          CMAKE_CUDA_ARCHITECTURES: "80;86;89;90"
 
       - name: Show wheel info before repair
         run: |
@@ -187,11 +189,15 @@ jobs:
           Get-ChildItem ../../src/pygpukit/*.pyd
 
       - name: Build wheel (C++ + Rust)
-        shell: pwsh
+        shell: cmd
         run: |
+          @REM Set up VS environment for cl.exe
+          call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"
           python -m build --wheel
         env:
-          CMAKE_CUDA_ARCHITECTURES: "86"
+          # PyGPUkit requires SM >= 80 (Ampere and newer)
+          # Self-hosted runner should have CUDA 13.1 for SM100/120 (Blackwell) support
+          CMAKE_CUDA_ARCHITECTURES: "80;86;89;90;100;120"
 
       - name: Verify wheel contents
         shell: pwsh

diff --git a/README.md b/README.md
@@ -7,6 +7,18 @@
 
 ---
 
+## Documentation
+
+| Guide | Description |
+|-------|-------------|
+| [Getting Started](docs/getting-started.md) | Installation, quick start, basic usage |
+| [API Reference](docs/api.md) | Complete API documentation with examples |
+| [LLM Guide](docs/llm.md) | SafeTensors, Tokenizer, GPT-2 model loading |
+| [Performance Tuning](docs/performance.md) | TF32, FP16, CUTLASS optimization |
+| [Scheduler Guide](docs/scheduler.md) | Multi-LLM concurrent execution |
+
+---
+
 ## Overview
 **PyGPUkit** is a lightweight GPU runtime for Python that provides:
 - **Single-binary distribution** — works with just GPU drivers, no CUDA Toolkit needed
@@ -21,6 +33,71 @@ PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and idea
 
 ---
 
+## What's New in v0.2.7
+
+### CUTLASS Epilogue Fusion
+Fused Linear + Bias + GELU operations using CUTLASS epilogue fusion for improved performance in transformer workloads.
+
+```python
+import pygpukit as gpk
+import numpy as np
+
+# Create tensors
+batch, in_feat, out_feat = 512, 768, 3072
+input = gpk.from_numpy(np.random.randn(batch, in_feat).astype(np.float32))
+weight = gpk.from_numpy(np.random.randn(out_feat, in_feat).astype(np.float32))
+bias = gpk.from_numpy(np.random.randn(out_feat).astype(np.float32))
+
+# Fused linear + bias + GELU (single kernel, no intermediate memory)
+output = gpk.linear_bias_gelu(input, weight, bias)
+```
+
+### Multi-SM CUTLASS Kernels
+Runtime SM detection with optimized kernel variants:
+- **SM80 (A100)**: 4-stage pipeline optimized for 48KB shared memory
+- **SM86+ (RTX 30xx/40xx, H100)**: 5-stage pipeline for 100KB+ shared memory
+
+### New Operations
+| Operation | Description |
+|-----------|-------------|
+| `gpk.transpose(a)` | GPU-native matrix transpose |
+| `gpk.bias_add_inplace(out, bias)` | In-place bias addition |
+| `gpk.linear_bias_gelu(x, w, b)` | Fused linear + bias + GELU |
+
+### API Improvements
+- Complete public API exports (all operations accessible via `gpk.*`)
+- Consistent snake_case naming convention
+- Full docstrings for all public functions
+
+---
+
+## LLM Support
+
+PyGPUkit includes built-in support for loading and running LLM models.
+See the [LLM Guide](docs/llm.md) for detailed documentation.
+
+```python
+from pygpukit.llm import SafeTensorsFile, Tokenizer
+
+# Load safetensors (memory-mapped, zero-copy)
+st = SafeTensorsFile("model.safetensors")
+print(f"Tensors: {st.num_tensors}, Size: {st.file_size / 1e9:.2f} GB")
+
+# Tokenizer (HuggingFace format)
+tok = Tokenizer("tokenizer.json")
+ids = tok.encode("Hello, world!")
+text = tok.decode(ids)
+```
+
+| Component | Description |
+|-----------|-------------|
+| `SafeTensorsFile` | Memory-mapped .safetensors loading |
+| `Tokenizer` | BPE tokenizer (HuggingFace format) |
+| `GPT2Model` | GPT-2 model (MLP-only MVP) |
+| `Linear`, `LayerNorm`, `MLP` | Model building blocks |
+
+---
+
 ## What's New in v0.2.6
 
 ### CUTLASS Backend (Default)
@@ -351,13 +428,38 @@ PyGPUkit/
 | **v0.2.4** | **Single-binary distribution**, dynamic NVRTC, driver-only mode |
 | **v0.2.5** | **FP16/BF16 support**, reduction ops, operator overloads, TF32 v2 (~30 TFLOPS) |
 | **v0.2.6** | **CUTLASS backend** (31 TFLOPS TF32, 63 TFLOPS FP16/BF16), Multi-LLM concurrent execution |
+| **v0.2.7** | **Epilogue fusion** (linear+bias+gelu), Multi-SM kernels, API review |
 
 ### Planned
 
 | Version | Goals |
 |---------|-------|
-| **v0.2.7** | Full API review, documentation, backward compatibility |
-| **v0.3** | Triton backend, advanced ops (softmax, layernorm), MPS/MIG |
+| **v0.3** | Triton backend, advanced ops (softmax), MPS/MIG |
+
+---
+
+## API Stability & Backward Compatibility
+
+### Version Policy
+- **v0.2.x**: Backward compatible within minor versions. New features may be added, but existing APIs remain stable.
+- **v0.3+**: May introduce breaking changes with deprecation warnings in prior version.
+
+### Stable Public API (v0.2.x)
+All functions exported via `pygpukit.*` are part of the stable public API:
+
+| Category | Functions |
+|----------|-----------|
+| **Factory** | `zeros`, `ones`, `empty`, `from_numpy` |
+| **Elementwise** | `add`, `sub`, `mul`, `div` |
+| **Math** | `exp`, `log`, `relu`, `gelu` |
+| **Matrix** | `matmul`, `transpose` |
+| **Reductions** | `sum`, `mean`, `max` |
+| **Neural** | `layernorm`, `bias_add_inplace`, `linear_bias_gelu` |
+| **Types** | `GPUArray`, `DataType`, `float32`, `float64`, `float16`, `bfloat16` |
+| **LLM** | `llm.SafeTensorsFile`, `llm.Tokenizer`, `llm.GPT2Model`, `llm.Linear` |
+
+### Deprecation Policy
+APIs to be removed will emit `DeprecationWarning` for at least one minor version before removal.
 
 ---
 

diff --git a/build_cuda12.bat b/build_cuda12.bat
@@ -0,0 +1,16 @@
+@echo off
+REM Build PyGPUkit with CUDA 12.4 using Ninja generator
+REM This script sets up VS environment for cl.exe and uses CUDA 12.4
+
+call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"
+
+set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4
+set CudaToolkitDir=%CUDA_PATH%
+set PATH=%CUDA_PATH%\bin;%PATH%
+
+echo.
+echo Building PyGPUkit with CUDA 12.4 (Ninja generator)...
+echo CUDA_PATH=%CUDA_PATH%
+echo.
+
+pip install -e . --no-build-isolation -v
diff --git a/build_cuda13.bat b/build_cuda13.bat
@@ -0,0 +1,16 @@
+@echo off
+REM Build PyGPUkit with CUDA 13.1 using Ninja generator
+REM This script sets up VS environment for cl.exe and uses CUDA 13.1
+
+call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"
+
+set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1
+set CUDA_PATH_V13_1=%CUDA_PATH%
+set PATH=%CUDA_PATH%\bin;%PATH%
+
+echo.
+echo Building PyGPUkit with CUDA 13.1 (Ninja generator)...
+echo CUDA_PATH=%CUDA_PATH%
+echo.
+
+pip install -e . --no-build-isolation -v