Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,9 @@ jobs:
run: |
python -m build --wheel
env:
CMAKE_CUDA_ARCHITECTURES: "70;75;80;86;89;90"
# PyGPUkit requires SM >= 80 (Ampere and newer)
# SM100/120 (Blackwell) requires CUDA 13.x - not available on GitHub runners yet
CMAKE_CUDA_ARCHITECTURES: "80;86;89;90"

- name: Show wheel info before repair
run: |
Expand Down Expand Up @@ -187,11 +189,15 @@ jobs:
Get-ChildItem ../../src/pygpukit/*.pyd

- name: Build wheel (C++ + Rust)
shell: pwsh
shell: cmd
run: |
@REM Set up VS environment for cl.exe
call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"
python -m build --wheel
env:
CMAKE_CUDA_ARCHITECTURES: "86"
# PyGPUkit requires SM >= 80 (Ampere and newer)
# Self-hosted runner should have CUDA 13.1 for SM100/120 (Blackwell) support
CMAKE_CUDA_ARCHITECTURES: "80;86;89;90;100;120"

- name: Verify wheel contents
shell: pwsh
Expand Down
106 changes: 104 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,18 @@

---

## Documentation

| Guide | Description |
|-------|-------------|
| [Getting Started](docs/getting-started.md) | Installation, quick start, basic usage |
| [API Reference](docs/api.md) | Complete API documentation with examples |
| [LLM Guide](docs/llm.md) | SafeTensors, Tokenizer, GPT-2 model loading |
| [Performance Tuning](docs/performance.md) | TF32, FP16, CUTLASS optimization |
| [Scheduler Guide](docs/scheduler.md) | Multi-LLM concurrent execution |

---

## Overview
**PyGPUkit** is a lightweight GPU runtime for Python that provides:
- **Single-binary distribution** — works with just GPU drivers, no CUDA Toolkit needed
Expand All @@ -21,6 +33,71 @@ PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and idea

---

## What's New in v0.2.7

### CUTLASS Epilogue Fusion
Fused Linear + Bias + GELU operations using CUTLASS epilogue fusion for improved performance in transformer workloads.

```python
import pygpukit as gpk
import numpy as np

# Create tensors
batch, in_feat, out_feat = 512, 768, 3072
input = gpk.from_numpy(np.random.randn(batch, in_feat).astype(np.float32))
weight = gpk.from_numpy(np.random.randn(out_feat, in_feat).astype(np.float32))
bias = gpk.from_numpy(np.random.randn(out_feat).astype(np.float32))

# Fused linear + bias + GELU (single kernel, no intermediate memory)
output = gpk.linear_bias_gelu(input, weight, bias)
```

### Multi-SM CUTLASS Kernels
Runtime SM detection with optimized kernel variants:
- **SM80 (A100)**: 4-stage pipeline optimized for 48KB shared memory
- **SM86+ (RTX 30xx/40xx, H100)**: 5-stage pipeline for 100KB+ shared memory

### New Operations
| Operation | Description |
|-----------|-------------|
| `gpk.transpose(a)` | GPU-native matrix transpose |
| `gpk.bias_add_inplace(out, bias)` | In-place bias addition |
| `gpk.linear_bias_gelu(x, w, b)` | Fused linear + bias + GELU |

### API Improvements
- Complete public API exports (all operations accessible via `gpk.*`)
- Consistent snake_case naming convention
- Full docstrings for all public functions

---

## LLM Support

PyGPUkit includes built-in support for loading and running LLM models.
See the [LLM Guide](docs/llm.md) for detailed documentation.

```python
from pygpukit.llm import SafeTensorsFile, Tokenizer

# Load safetensors (memory-mapped, zero-copy)
st = SafeTensorsFile("model.safetensors")
print(f"Tensors: {st.num_tensors}, Size: {st.file_size / 1e9:.2f} GB")

# Tokenizer (HuggingFace format)
tok = Tokenizer("tokenizer.json")
ids = tok.encode("Hello, world!")
text = tok.decode(ids)
```

| Component | Description |
|-----------|-------------|
| `SafeTensorsFile` | Memory-mapped .safetensors loading |
| `Tokenizer` | BPE tokenizer (HuggingFace format) |
| `GPT2Model` | GPT-2 model (MLP-only MVP) |
| `Linear`, `LayerNorm`, `MLP` | Model building blocks |

---

## What's New in v0.2.6

### CUTLASS Backend (Default)
Expand Down Expand Up @@ -351,13 +428,38 @@ PyGPUkit/
| **v0.2.4** | **Single-binary distribution**, dynamic NVRTC, driver-only mode |
| **v0.2.5** | **FP16/BF16 support**, reduction ops, operator overloads, TF32 v2 (~30 TFLOPS) |
| **v0.2.6** | **CUTLASS backend** (31 TFLOPS TF32, 63 TFLOPS FP16/BF16), Multi-LLM concurrent execution |
| **v0.2.7** | **Epilogue fusion** (linear+bias+gelu), Multi-SM kernels, API review |

### Planned

| Version | Goals |
|---------|-------|
| **v0.2.7** | Full API review, documentation, backward compatibility |
| **v0.3** | Triton backend, advanced ops (softmax, layernorm), MPS/MIG |
| **v0.3** | Triton backend, advanced ops (softmax), MPS/MIG |

---

## API Stability & Backward Compatibility

### Version Policy
- **v0.2.x**: Backward compatible within minor versions. New features may be added, but existing APIs remain stable.
- **v0.3+**: May introduce breaking changes with deprecation warnings in prior version.

### Stable Public API (v0.2.x)
All functions exported via `pygpukit.*` are part of the stable public API:

| Category | Functions |
|----------|-----------|
| **Factory** | `zeros`, `ones`, `empty`, `from_numpy` |
| **Elementwise** | `add`, `sub`, `mul`, `div` |
| **Math** | `exp`, `log`, `relu`, `gelu` |
| **Matrix** | `matmul`, `transpose` |
| **Reductions** | `sum`, `mean`, `max` |
| **Neural** | `layernorm`, `bias_add_inplace`, `linear_bias_gelu` |
| **Types** | `GPUArray`, `DataType`, `float32`, `float64`, `float16`, `bfloat16` |
| **LLM** | `llm.SafeTensorsFile`, `llm.Tokenizer`, `llm.GPT2Model`, `llm.Linear` |

### Deprecation Policy
APIs to be removed will emit `DeprecationWarning` for at least one minor version before removal.

---

Expand Down
16 changes: 16 additions & 0 deletions build_cuda12.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
@echo off
REM Build PyGPUkit with CUDA 12.4 using Ninja generator
REM This script sets up VS environment for cl.exe and uses CUDA 12.4

call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"

set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4
set CudaToolkitDir=%CUDA_PATH%
set PATH=%CUDA_PATH%\bin;%PATH%

echo.
echo Building PyGPUkit with CUDA 12.4 (Ninja generator)...
echo CUDA_PATH=%CUDA_PATH%
echo.

pip install -e . --no-build-isolation -v
16 changes: 16 additions & 0 deletions build_cuda13.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
@echo off
REM Build PyGPUkit with CUDA 13.1 using Ninja generator
REM This script sets up VS environment for cl.exe and uses CUDA 13.1

call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"

set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1
set CUDA_PATH_V13_1=%CUDA_PATH%
set PATH=%CUDA_PATH%\bin;%PATH%

echo.
echo Building PyGPUkit with CUDA 13.1 (Ninja generator)...
echo CUDA_PATH=%CUDA_PATH%
echo.

pip install -e . --no-build-isolation -v
Loading
Loading