Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
6e173b9
chore: bump version to 0.2.15
m96-chan Dec 23, 2025
d03df85
feat(asr): add Whisper audio preprocessing (#103)
m96-chan Dec 23, 2025
e6f7bb0
feat(asr): add Whisper model loader (#100)
m96-chan Dec 23, 2025
b47de57
feat(asr): add Whisper encoder (#101)
m96-chan Dec 23, 2025
a51ad3f
feat(asr): add Whisper decoder (#102)
m96-chan Dec 23, 2025
d3f6d40
feat(asr): add WhisperModel with streaming inference (#104)
m96-chan Dec 23, 2025
18f694b
docs: update project structure with ASR module
m96-chan Dec 23, 2025
ed01c6d
feat(examples): add real-time STT demo with Whisper
m96-chan Dec 23, 2025
1ee832b
fix(asr): handle bfloat16 tensors without PyTorch
m96-chan Dec 23, 2025
7431270
fix(asr): handle optional bias weights in encoder/decoder
m96-chan Dec 23, 2025
afaee7f
feat(examples): add microphone device selection options
m96-chan Dec 23, 2025
186fdf9
fix(asr): use to_numpy() instead of numpy() for GPUArray
m96-chan Dec 23, 2025
ca21f87
fix(asr): convert GPUArray to numpy before mel spectrogram computation
m96-chan Dec 23, 2025
c6f729f
feat(core): add scalar arithmetic support to GPUArray
m96-chan Dec 23, 2025
9531a85
feat(core): add transpose and reshape methods to GPUArray
m96-chan Dec 23, 2025
f9a736c
feat(core): add __getitem__ for array indexing and slicing
m96-chan Dec 23, 2025
eeee4fa
fix(asr): fix positional embedding shape mismatch in encoder/decoder
m96-chan Dec 23, 2025
0acbd8d
fix(asr): complete Whisper inference pipeline
m96-chan Dec 23, 2025
afec9b1
feat(ops): add GPU kernels for 4D tensor operations
m96-chan Dec 23, 2025
9ae317a
fix(ops): SM 120 (Blackwell) compatibility for CUTLASS/cuBLASLt
m96-chan Dec 23, 2025
a92dc8f
feat(build): default to CUDA 13.1, add FP8 SM120 infrastructure
m96-chan Dec 23, 2025
0bea5de
fix(ci): use SM 120a for full accelerated features
m96-chan Dec 23, 2025
5277bbb
feat(fp8): add SM90 (Hopper) FP8 GEMM fallback for SM120
m96-chan Dec 23, 2025
c081607
feat(fp8): add SM100 FP8 GEMM (Blackwell datacenter)
m96-chan Dec 23, 2025
40369a2
fix(cutlass): SM120 fallback to CUTLASS 2.x TensorCore kernels
m96-chan Dec 23, 2025
e1d22d4
feat(gemv): add CUTLASS-based GEMV kernel for M=1 decode path
m96-chan Dec 23, 2025
dc8225a
perf(gemv): add vectorized BF16x2 loads for 25-40% speedup
m96-chan Dec 24, 2025
def852a
feat(gemv): add per-size tuning with if constexpr template dispatch
m96-chan Dec 24, 2025
51c1dfc
feat(transpose): add native GPU transpose kernels for issue #106
m96-chan Dec 24, 2025
a48f664
feat(fp8): SM120 FP8 GEMM with CUTLASS alignment workarounds
m96-chan Dec 24, 2025
1e101f8
wip(fp8): add BF16 I/O FP8 GEMM for SM120 (not working yet)
m96-chan Dec 24, 2025
f851862
chore(deps): switch CUTLASS to fork with SM120 alignment fixes
m96-chan Dec 25, 2025
a311e4b
feat(nvf4): add NVF4 BF16 GEMM kernel for SM120
m96-chan Dec 25, 2025
5b77c57
refactor(fp8): remove redundant FP8 BF16 SM120 variant
m96-chan Dec 25, 2025
abe6ace
chore: add missing SM120 alignment header and FP8 test
m96-chan Dec 25, 2025
580d76d
feat(gemv): add NVF4 GEMV kernel for SM120 with pre-scaled LUT optimi…
m96-chan Dec 25, 2025
dbc5635
perf(gemv): add UE4M3 scale LUT for NVF4 GEMV
m96-chan Dec 25, 2025
5a15f1d
docs: add GEMV benchmark comparison to README
m96-chan Dec 25, 2025
3904f54
perf(linear): use GEMV for M=1 decode with zero-copy views
m96-chan Dec 25, 2025
cce16b6
fix(view): keep source reference to prevent use-after-free
m96-chan Dec 25, 2025
65e2c33
feat(cublaslt): add PYGPUKIT_CUBLASLT_SM120 env var for testing
m96-chan Dec 25, 2025
8021aa8
feat(nvf4): GPU-side quantization for 170x speedup on SM120
m96-chan Dec 25, 2025
f2e7bd0
feat(nvf4): add pure NVF4 GEMM benchmark kernel for SM120
m96-chan Dec 25, 2025
7273197
perf(nvf4): optimize BF16->NVF4 quantization with branchless + vector…
m96-chan Dec 25, 2025
39d5349
perf(nvf4): eliminate D2D copy by writing to user buffer directly
m96-chan Dec 25, 2025
51356b5
perf(nvf4): use 3-stage pipeline for Pure NVF4 (446 TFLOPS)
m96-chan Dec 25, 2025
1f708a7
perf(nvf4): vectorize quantize_B + stream overlap (+5% BF16 I/O)
m96-chan Dec 25, 2025
9ac91a0
feat(ops): add missing GPU kernels for inference completeness (#109)
m96-chan Dec 25, 2025
42b64c1
feat(ops): add Medium Priority kernels (#109)
m96-chan Dec 25, 2025
4d64b49
feat(ops): add remaining Medium and Low Priority kernels (#109)
m96-chan Dec 25, 2025
2c35ba4
feat(ops): add Python bindings for Issue #109 kernels
m96-chan Dec 25, 2025
982a8e5
feat(v0.2.15): FP8 I/O GEMM, Pure NVF4, new math ops
m96-chan Dec 26, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ jobs:
mkdir -p build && cd build
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES="80;86;89;90;100;120" \
-DCMAKE_CUDA_ARCHITECTURES="80;86;89;90;100;120a" \
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
-DCMAKE_CUDA_COMPILER_LAUNCHER=ccache \
-Dpybind11_DIR=$(python -c "import pybind11; print(pybind11.get_cmake_dir())")
Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ jobs:
-DCMAKE_BUILD_TYPE=Release \
-DPYBIND11_FINDPYTHON=ON \
-Dpybind11_DIR=$(python -c "import pybind11; print(pybind11.get_cmake_dir())") \
-DCMAKE_CUDA_ARCHITECTURES="80;86;89;90;100;120" \
-DCMAKE_CUDA_ARCHITECTURES="80;86;89;90;100;120a" \
-DMODULE_SUFFIX="_cu131"
cmake --build . --config Release -j$(nproc)

Expand Down Expand Up @@ -216,7 +216,7 @@ jobs:
env:
# Skip native build since we have prebuilt modules
PYGPUKIT_SKIP_NATIVE_BUILD: "1"
CMAKE_CUDA_ARCHITECTURES: "80;86;89;90;100;120"
CMAKE_CUDA_ARCHITECTURES: "80;86;89;90;100;120a"

- name: Inject prebuilt native modules into wheel
run: |
Expand Down Expand Up @@ -419,7 +419,7 @@ jobs:
-DCMAKE_BUILD_TYPE=Release ^
-DPYBIND11_FINDPYTHON=ON ^
-Dpybind11_DIR="%PYBIND11_DIR%" ^
-DCMAKE_CUDA_ARCHITECTURES="80;86;89;90;100;120" ^
-DCMAKE_CUDA_ARCHITECTURES="80;86;89;90;100;120a" ^
-DMODULE_SUFFIX="_cu131"
cmake --build . --config Release

Expand Down Expand Up @@ -537,7 +537,7 @@ jobs:
set "PYGPUKIT_SKIP_NATIVE_BUILD=1"
python -m build --wheel
env:
CMAKE_CUDA_ARCHITECTURES: "80;86;89;90;100;120"
CMAKE_CUDA_ARCHITECTURES: "80;86;89;90;100;120a"

- name: Inject prebuilt native modules into wheel
shell: pwsh
Expand Down
3 changes: 2 additions & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
[submodule "third_party/cutlass"]
path = third_party/cutlass
url = https://github.com/NVIDIA/cutlass.git
url = https://github.com/m96-chan/cutlass.git
branch = fix/sm120-alignment
18 changes: 18 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,24 @@

All notable changes to PyGPUkit will be documented in this file.

## [0.2.15] - 2025-12-26

### Added
- **FP8 I/O GEMM (SM120)**: Pure FP8 E4M3 input/output GEMM for FP8 model inference
- `matmul_fp8_fp8_sm120`: FP8 GEMM with unity scaling
- `matmul_fp8_fp8_blockwise_sm120`: FP8 GEMM with per-block scale factors
- `fp8_fp8_get_scale_sizes`: Get required scale factor sizes for (M, N, K)
- `fp8_fp8_sm120_available`: Check SM120 FP8 I/O availability
- **Pure NVF4 GEMM**: GPU-side BF16->NVF4 quantization with 3-stage pipeline (446 TFLOPS)
- **New math operations**: sin, cos, sqrt, rsqrt, abs, neg
- **New comparison operations**: clamp, where
- **New activation functions**: sigmoid, tanh
- **New reduction operations**: argmax, min, sum_axis
- **uint8/int8 NumPy support**: `from_numpy` now supports uint8 and int8 arrays

### Changed
- Renamed `matmul_fp8_sm120.cu` to `matmul_fp8_fp32_sm120.cu` for clarity (FP8 compute, FP32 output)

## [0.2.14] - 2025-12-23

### Fixed
Expand Down
35 changes: 33 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,19 @@ The core scheduling, memory management, GPU coordination, and performance-critic
```
PyGPUkit/
├── src/pygpukit/ # Python API (NumPy-compatible)
│ ├── core/ # GPUArray, backend abstraction
│ ├── ops/ # GPU operations (matmul, nn, audio, etc.)
│ ├── llm/ # LLM inference (Qwen, LLaMA)
│ │ ├── models/ # Model implementations
│ │ └── sampling/ # Token sampling strategies
│ └── asr/ # Speech recognition (Whisper)
│ ├── preprocessing.py # Audio preprocessing (mel, normalize)
│ └── whisper/ # Whisper model implementation
│ ├── config.py # WhisperConfig
│ ├── loader.py # SafeTensors loader
│ ├── encoder.py # Whisper encoder
│ ├── decoder.py # Whisper decoder
│ └── model.py # WhisperModel high-level API
├── native/
│ ├── core/ # C++ (CUDA Runtime/Driver API)
│ ├── jit/ # C++ (NVRTC)
Expand All @@ -48,9 +61,20 @@ PyGPUkit/
│ │ └── device.rs # DeviceCapabilities, KernelType
│ └── pygpukit-python/ # PyO3 bindings
├── examples/
├── benchmarks/ # Performance benchmarks
└── tests/
```

### Module Separation Policy

| Module | Purpose | Input | Output |
|--------|---------|-------|--------|
| `llm/` | Text generation | Text tokens | Text tokens |
| `asr/` | Speech recognition | Audio waveform | Text |
| `ops/` | Low-level GPU ops | GPUArray | GPUArray |

**Rationale**: Modules are separated by **modality** (audio vs text), not by architecture (transformer). This follows industry conventions (HuggingFace, OpenAI API) and enables clean future expansion (TTS, vision, etc.).

### Language Responsibilities

| Component | Language | Reason |
Expand Down Expand Up @@ -530,7 +554,7 @@ Edit → Build → Validate → Benchmark → Commit
cd /d/Projects/m96-chan/PyGPUkit
./build.sh 86 # SM 86のみ (RTX 3090 Ti)
./build.sh 120 # SM 120のみ (RTX 5090)
./build.sh # デフォルト: SM 86
./build.sh # デフォルト: SM 120a
```

**Windows cmd.exeからビルド(代替):**
Expand Down Expand Up @@ -939,11 +963,18 @@ accepted_tokens = model.jacobi_decode_step(draft_tokens, position)
cd /d/Projects/m96-chan/PyGPUkit
./build.sh 86 # SM 86のみ (RTX 3090 Ti)
./build.sh 120 # SM 120のみ (RTX 5090)
./build.sh # デフォルト: SM 86
./build.sh # デフォルト: SM 120a
```

**サポートSM:** 80, 86, 89, 90, 100, 120

### Local Development Hardware

| Machine | GPU | SM | CUDA Toolkit | Notes |
|---------|-----|-----|--------------|-------|
| Primary | RTX 5090 | 120 | 13.1 | Blackwell GeForce, FP8 testing |
| Secondary | RTX 3090 Ti | 86 | 12.x | Ampere, TF32 benchmarks |

### Tokenizer

**PyGPUkit内蔵のTokenizerは使用しない。HuggingFace `tokenizers`ライブラリを使用する。**
Expand Down
120 changes: 118 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,90 @@ PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and idea

---

## What's New in v0.2.15

### FP8 I/O GEMM (SM120)
Pure FP8 input/output GEMM for FP8 model inference (Llama 3.1 FP8, Qwen FP8, etc.):

| Function | Description |
|----------|-------------|
| `matmul_fp8_fp8_sm120` | FP8 E4M3 input -> FP8 E4M3 output (unity scaling) |
| `matmul_fp8_fp8_blockwise_sm120` | FP8 with block-wise scale_A / scale_B |
| `fp8_fp8_get_scale_sizes` | Get required scale factor sizes for (M, N, K) |
| `fp8_fp8_sm120_available` | Check SM120 FP8 I/O availability |

```python
import pygpukit as gpk
import numpy as np

# Check availability
if gpk.fp8_fp8_sm120_available():
# Get scale sizes for blockwise scaling
sfa_size, sfb_size = gpk.fp8_fp8_get_scale_sizes(M, N, K)

# Blockwise scaled FP8 GEMM (for real FP8 models)
scale_a = gpk.from_numpy(np.ones(sfa_size, dtype=np.float32))
scale_b = gpk.from_numpy(np.ones(sfb_size, dtype=np.float32))
C = gpk.matmul_fp8_fp8_blockwise_sm120(A_fp8, B_fp8, scale_a, scale_b)
```

### Pure NVF4 GEMM (446 TFLOPS)
GPU-side BF16->NVF4 quantization with 3-stage pipeline for maximum throughput:

| Matrix Size | TFLOPS | Notes |
|-------------|--------|-------|
| 8192x8192 | 320 | Branchless vectorized loads |
| 12288x12288 | 400 | 3-stage async pipeline |
| 16384x16384 | **446** | Direct write to user buffer |

### New Math Operations
Extended math operations for GPU computing:

| Category | Operations |
|----------|------------|
| **Trigonometric** | `sin`, `cos` |
| **Power/Root** | `sqrt`, `rsqrt` |
| **Sign** | `abs`, `neg` |
| **Comparison** | `clamp`, `where` |
| **Activation** | `sigmoid`, `tanh` |
| **Reduction** | `argmax`, `min`, `sum_axis` |

```python
import pygpukit as gpk

# Trigonometric
y = gpk.sin(x)
y = gpk.cos(x)

# Power operations
y = gpk.sqrt(x)
y = gpk.rsqrt(x) # 1/sqrt(x)

# Element-wise comparison
y = gpk.clamp(x, min_val=-1.0, max_val=1.0)
y = gpk.where(cond, x, y) # cond ? x : y

# New activations
y = gpk.sigmoid(x)
y = gpk.tanh(x)

# New reductions
idx = gpk.argmax(x) # Index of maximum
val = gpk.min(x) # Minimum value
y = gpk.sum_axis(x, 1) # Sum along axis
```

### uint8/int8 NumPy Support
`from_numpy` now supports uint8 and int8 arrays for FP8 data handling:

```python
# FP8 data stored as uint8
fp8_data = np.array([...], dtype=np.uint8)
gpu_fp8 = gpk.from_numpy(fp8_data)
```

---

## What's New in v0.2.14

### Packaging Fixes
Expand All @@ -43,10 +127,10 @@ v0.2.13 and v0.2.14 fix wheel RECORD file issues that caused PyPI deprecation wa
| v0.2.14 | Windows wheel missing `licenses/LICENSE` in RECORD | Added `-Recurse` to scan dist-info subdirectories |
| v0.2.13 | Hardcoded version in release workflow | Dynamic dist-info folder detection |

**Recommended:** Use v0.2.14 or later.
**Recommended:** Use v0.2.15 or later.

```bash
pip install pygpukit>=0.2.14
pip install pygpukit>=0.2.15
```

---
Expand Down Expand Up @@ -530,6 +614,37 @@ print(f"NVRTC Path: {gp.get_nvrtc_path()}") # Path to NVRTC DLL (if available)

> **Note:** CUTLASS is automatic for compatible sizes (16-aligned). Use `PYGPUKIT_NO_TF32=1` for full FP32 precision.

### GEMV Performance (RTX 5090, SM120a)

For LLM decode (M=1), custom GEMV kernels significantly outperform cuBLASLt:

| Model Layer | K | N | cuBLASLt | BF16 GEMV | NVF4 GEMV | Memory |
|-------------|------|-------|----------|-----------|-----------|--------|
| Qwen-7B hidden | 4096 | 4096 | 413us | **97us** | 152us | 73% less |
| Qwen-7B MLP | 4096 | 11008 | 418us | **96us** | 153us | 73% less |
| Qwen-72B hidden | 8192 | 8192 | 799us | 266us | **265us** | 73% less |
| Qwen-72B MLP | 8192 | 29568 | 1603us | **375us** | 454us | 73% less |

| Kernel | Description | Use Case |
|--------|-------------|----------|
| **BF16 GEMV** | Custom BF16 kernel optimized for M=1 | Speed priority |
| **NVF4 GEMV** | 4-bit NVF4 weights with block scaling | Memory priority (73% reduction) |

> **Note:** For large K (8192+), NVF4 matches BF16 speed while using 73% less memory. Ideal for memory-constrained LLM inference.

### NVF4-BF16 GEMM Performance (RTX 5090, SM120a)

4-bit NVF4 GEMM with BF16 I/O using CUTLASS block-scaled tensor operations:

| Matrix Size | TFLOPS (median) | TFLOPS (max) | Time (ms) |
|-------------|-----------------|--------------|-----------|
| 4096×4096 | 53 | 55 | 2.6 |
| 8192×8192 | 141 | 143 | 7.8 |
| 12288×12288 | 201 | 216 | 18.5 |
| 16384×16384 | **246** | **252** | 35.8 |

> **Note:** GPU-side BF16→NVF4 quantization with unit scaling. No host-device copies. Ideal for memory-bound LLM inference with 4x bandwidth reduction vs BF16.

---

## Installation
Expand Down Expand Up @@ -695,6 +810,7 @@ PyGPUkit/
| **v0.2.10** | **Dynamic cuBLASLt loading**, CUDA Graph optimizations, descriptor caching |
| **v0.2.11** | **Batch decode** (6.8x speedup), Decode Strategy framework, Driver API async, Dual CUDA builds, RTX 5090 (SM120) |
| **v0.2.12** | **Advanced audio processing** (ISTFT, Griffin-Lim, HPSS, CQT, pitch detection, time stretch) |
| **v0.2.15** | **FP8 I/O GEMM** (blockwise scaling), Pure NVF4 (446 TFLOPS), New math ops (sin, cos, sqrt, rsqrt, abs, neg, clamp, where, sigmoid, tanh, argmax, min, sum_axis) |

### Planned

Expand Down
Loading