m96-chan · m96-chan · Dec 26, 2025 · Dec 23, 2025 · Dec 23, 2025 · Dec 23, 2025
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -102,7 +102,7 @@ jobs:
           mkdir -p build && cd build
           cmake .. \
             -DCMAKE_BUILD_TYPE=Release \
-            -DCMAKE_CUDA_ARCHITECTURES="80;86;89;90;100;120" \
+            -DCMAKE_CUDA_ARCHITECTURES="80;86;89;90;100;120a" \
             -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
             -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache \
             -Dpybind11_DIR=$(python -c "import pybind11; print(pybind11.get_cmake_dir())")

diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -127,7 +127,7 @@ jobs:
             -DCMAKE_BUILD_TYPE=Release \
             -DPYBIND11_FINDPYTHON=ON \
             -Dpybind11_DIR=$(python -c "import pybind11; print(pybind11.get_cmake_dir())") \
-            -DCMAKE_CUDA_ARCHITECTURES="80;86;89;90;100;120" \
+            -DCMAKE_CUDA_ARCHITECTURES="80;86;89;90;100;120a" \
             -DMODULE_SUFFIX="_cu131"
           cmake --build . --config Release -j$(nproc)
 
@@ -216,7 +216,7 @@ jobs:
         env:
           # Skip native build since we have prebuilt modules
           PYGPUKIT_SKIP_NATIVE_BUILD: "1"
-          CMAKE_CUDA_ARCHITECTURES: "80;86;89;90;100;120"
+          CMAKE_CUDA_ARCHITECTURES: "80;86;89;90;100;120a"
 
       - name: Inject prebuilt native modules into wheel
         run: |
@@ -419,7 +419,7 @@ jobs:
             -DCMAKE_BUILD_TYPE=Release ^
             -DPYBIND11_FINDPYTHON=ON ^
             -Dpybind11_DIR="%PYBIND11_DIR%" ^
-            -DCMAKE_CUDA_ARCHITECTURES="80;86;89;90;100;120" ^
+            -DCMAKE_CUDA_ARCHITECTURES="80;86;89;90;100;120a" ^
             -DMODULE_SUFFIX="_cu131"
           cmake --build . --config Release
 
@@ -537,7 +537,7 @@ jobs:
           set "PYGPUKIT_SKIP_NATIVE_BUILD=1"
           python -m build --wheel
         env:
-          CMAKE_CUDA_ARCHITECTURES: "80;86;89;90;100;120"
+          CMAKE_CUDA_ARCHITECTURES: "80;86;89;90;100;120a"
 
       - name: Inject prebuilt native modules into wheel
         shell: pwsh

diff --git a/.gitmodules b/.gitmodules
@@ -1,3 +1,4 @@
 [submodule "third_party/cutlass"]
 	path = third_party/cutlass
-	url = https://github.com/NVIDIA/cutlass.git
+	url = https://github.com/m96-chan/cutlass.git
+	branch = fix/sm120-alignment
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,24 @@
 
 All notable changes to PyGPUkit will be documented in this file.
 
+## [0.2.15] - 2025-12-26
+
+### Added
+- **FP8 I/O GEMM (SM120)**: Pure FP8 E4M3 input/output GEMM for FP8 model inference
+  - `matmul_fp8_fp8_sm120`: FP8 GEMM with unity scaling
+  - `matmul_fp8_fp8_blockwise_sm120`: FP8 GEMM with per-block scale factors
+  - `fp8_fp8_get_scale_sizes`: Get required scale factor sizes for (M, N, K)
+  - `fp8_fp8_sm120_available`: Check SM120 FP8 I/O availability
+- **Pure NVF4 GEMM**: GPU-side BF16->NVF4 quantization with 3-stage pipeline (446 TFLOPS)
+- **New math operations**: sin, cos, sqrt, rsqrt, abs, neg
+- **New comparison operations**: clamp, where
+- **New activation functions**: sigmoid, tanh
+- **New reduction operations**: argmax, min, sum_axis
+- **uint8/int8 NumPy support**: `from_numpy` now supports uint8 and int8 arrays
+
+### Changed
+- Renamed `matmul_fp8_sm120.cu` to `matmul_fp8_fp32_sm120.cu` for clarity (FP8 compute, FP32 output)
+
 ## [0.2.14] - 2025-12-23
 
 ### Fixed

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -35,6 +35,19 @@ The core scheduling, memory management, GPU coordination, and performance-critic
 ```
 PyGPUkit/
 ├── src/pygpukit/           # Python API (NumPy-compatible)
+│   ├── core/               # GPUArray, backend abstraction
+│   ├── ops/                # GPU operations (matmul, nn, audio, etc.)
+│   ├── llm/                # LLM inference (Qwen, LLaMA)
+│   │   ├── models/         # Model implementations
+│   │   └── sampling/       # Token sampling strategies
+│   └── asr/                # Speech recognition (Whisper)
+│       ├── preprocessing.py    # Audio preprocessing (mel, normalize)
+│       └── whisper/            # Whisper model implementation
+│           ├── config.py       # WhisperConfig
+│           ├── loader.py       # SafeTensors loader
+│           ├── encoder.py      # Whisper encoder
+│           ├── decoder.py      # Whisper decoder
+│           └── model.py        # WhisperModel high-level API
 ├── native/
 │   ├── core/               # C++ (CUDA Runtime/Driver API)
 │   ├── jit/                # C++ (NVRTC)
@@ -48,9 +61,20 @@ PyGPUkit/
 │   │       └── device.rs   # DeviceCapabilities, KernelType
 │   └── pygpukit-python/    # PyO3 bindings
 ├── examples/
+├── benchmarks/             # Performance benchmarks
 └── tests/
 ```
 
+### Module Separation Policy
+
+| Module | Purpose | Input | Output |
+|--------|---------|-------|--------|
+| `llm/` | Text generation | Text tokens | Text tokens |
+| `asr/` | Speech recognition | Audio waveform | Text |
+| `ops/` | Low-level GPU ops | GPUArray | GPUArray |
+
+**Rationale**: Modules are separated by **modality** (audio vs text), not by architecture (transformer). This follows industry conventions (HuggingFace, OpenAI API) and enables clean future expansion (TTS, vision, etc.).
+
 ### Language Responsibilities
 
 | Component | Language | Reason |
@@ -530,7 +554,7 @@ Edit → Build → Validate → Benchmark → Commit
 cd /d/Projects/m96-chan/PyGPUkit
 ./build.sh 86       # SM 86のみ (RTX 3090 Ti)
 ./build.sh 120      # SM 120のみ (RTX 5090)
-./build.sh          # デフォルト: SM 86
+./build.sh          # デフォルト: SM 120a
 ```
 
 **Windows cmd.exeからビルド（代替）：**
@@ -939,11 +963,18 @@ accepted_tokens = model.jacobi_decode_step(draft_tokens, position)
 cd /d/Projects/m96-chan/PyGPUkit
 ./build.sh 86       # SM 86のみ (RTX 3090 Ti)
 ./build.sh 120      # SM 120のみ (RTX 5090)
-./build.sh          # デフォルト: SM 86
+./build.sh          # デフォルト: SM 120a
 ```
 
 **サポートSM:** 80, 86, 89, 90, 100, 120
 
+### Local Development Hardware
+
+| Machine | GPU | SM | CUDA Toolkit | Notes |
+|---------|-----|-----|--------------|-------|
+| Primary | RTX 5090 | 120 | 13.1 | Blackwell GeForce, FP8 testing |
+| Secondary | RTX 3090 Ti | 86 | 12.x | Ampere, TF32 benchmarks |
+
 ### Tokenizer
 
 **PyGPUkit内蔵のTokenizerは使用しない。HuggingFace `tokenizers`ライブラリを使用する。**

diff --git a/README.md b/README.md
@@ -33,6 +33,90 @@ PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and idea
 
 ---
 
+## What's New in v0.2.15
+
+### FP8 I/O GEMM (SM120)
+Pure FP8 input/output GEMM for FP8 model inference (Llama 3.1 FP8, Qwen FP8, etc.):
+
+| Function | Description |
+|----------|-------------|
+| `matmul_fp8_fp8_sm120` | FP8 E4M3 input -> FP8 E4M3 output (unity scaling) |
+| `matmul_fp8_fp8_blockwise_sm120` | FP8 with block-wise scale_A / scale_B |
+| `fp8_fp8_get_scale_sizes` | Get required scale factor sizes for (M, N, K) |
+| `fp8_fp8_sm120_available` | Check SM120 FP8 I/O availability |
+
+```python
+import pygpukit as gpk
+import numpy as np
+
+# Check availability
+if gpk.fp8_fp8_sm120_available():
+    # Get scale sizes for blockwise scaling
+    sfa_size, sfb_size = gpk.fp8_fp8_get_scale_sizes(M, N, K)
+
+    # Blockwise scaled FP8 GEMM (for real FP8 models)
+    scale_a = gpk.from_numpy(np.ones(sfa_size, dtype=np.float32))
+    scale_b = gpk.from_numpy(np.ones(sfb_size, dtype=np.float32))
+    C = gpk.matmul_fp8_fp8_blockwise_sm120(A_fp8, B_fp8, scale_a, scale_b)
+```
+
+### Pure NVF4 GEMM (446 TFLOPS)
+GPU-side BF16->NVF4 quantization with 3-stage pipeline for maximum throughput:
+
+| Matrix Size | TFLOPS | Notes |
+|-------------|--------|-------|
+| 8192x8192 | 320 | Branchless vectorized loads |
+| 12288x12288 | 400 | 3-stage async pipeline |
+| 16384x16384 | **446** | Direct write to user buffer |
+
+### New Math Operations
+Extended math operations for GPU computing:
+
+| Category | Operations |
+|----------|------------|
+| **Trigonometric** | `sin`, `cos` |
+| **Power/Root** | `sqrt`, `rsqrt` |
+| **Sign** | `abs`, `neg` |
+| **Comparison** | `clamp`, `where` |
+| **Activation** | `sigmoid`, `tanh` |
+| **Reduction** | `argmax`, `min`, `sum_axis` |
+
+```python
+import pygpukit as gpk
+
+# Trigonometric
+y = gpk.sin(x)
+y = gpk.cos(x)
+
+# Power operations
+y = gpk.sqrt(x)
+y = gpk.rsqrt(x)  # 1/sqrt(x)
+
+# Element-wise comparison
+y = gpk.clamp(x, min_val=-1.0, max_val=1.0)
+y = gpk.where(cond, x, y)  # cond ? x : y
+
+# New activations
+y = gpk.sigmoid(x)
+y = gpk.tanh(x)
+
+# New reductions
+idx = gpk.argmax(x)     # Index of maximum
+val = gpk.min(x)        # Minimum value
+y = gpk.sum_axis(x, 1)  # Sum along axis
+```
+
+### uint8/int8 NumPy Support
+`from_numpy` now supports uint8 and int8 arrays for FP8 data handling:
+
+```python
+# FP8 data stored as uint8
+fp8_data = np.array([...], dtype=np.uint8)
+gpu_fp8 = gpk.from_numpy(fp8_data)
+```
+
+---
+
 ## What's New in v0.2.14
 
 ### Packaging Fixes
@@ -43,10 +127,10 @@ v0.2.13 and v0.2.14 fix wheel RECORD file issues that caused PyPI deprecation wa
 | v0.2.14 | Windows wheel missing `licenses/LICENSE` in RECORD | Added `-Recurse` to scan dist-info subdirectories |
 | v0.2.13 | Hardcoded version in release workflow | Dynamic dist-info folder detection |
 
-**Recommended:** Use v0.2.14 or later.
+**Recommended:** Use v0.2.15 or later.
 
 ```bash
-pip install pygpukit>=0.2.14
+pip install pygpukit>=0.2.15
 ```
 
 ---
@@ -530,6 +614,37 @@ print(f"NVRTC Path: {gp.get_nvrtc_path()}")   # Path to NVRTC DLL (if available)
 
 > **Note:** CUTLASS is automatic for compatible sizes (16-aligned). Use `PYGPUKIT_NO_TF32=1` for full FP32 precision.
 
+### GEMV Performance (RTX 5090, SM120a)
+
+For LLM decode (M=1), custom GEMV kernels significantly outperform cuBLASLt:
+
+| Model Layer | K | N | cuBLASLt | BF16 GEMV | NVF4 GEMV | Memory |
+|-------------|------|-------|----------|-----------|-----------|--------|
+| Qwen-7B hidden | 4096 | 4096 | 413us | **97us** | 152us | 73% less |
+| Qwen-7B MLP | 4096 | 11008 | 418us | **96us** | 153us | 73% less |
+| Qwen-72B hidden | 8192 | 8192 | 799us | 266us | **265us** | 73% less |
+| Qwen-72B MLP | 8192 | 29568 | 1603us | **375us** | 454us | 73% less |
+
+| Kernel | Description | Use Case |
+|--------|-------------|----------|
+| **BF16 GEMV** | Custom BF16 kernel optimized for M=1 | Speed priority |
+| **NVF4 GEMV** | 4-bit NVF4 weights with block scaling | Memory priority (73% reduction) |
+
+> **Note:** For large K (8192+), NVF4 matches BF16 speed while using 73% less memory. Ideal for memory-constrained LLM inference.
+
+### NVF4-BF16 GEMM Performance (RTX 5090, SM120a)
+
+4-bit NVF4 GEMM with BF16 I/O using CUTLASS block-scaled tensor operations:
+
+| Matrix Size | TFLOPS (median) | TFLOPS (max) | Time (ms) |
+|-------------|-----------------|--------------|-----------|
+| 4096×4096 | 53 | 55 | 2.6 |
+| 8192×8192 | 141 | 143 | 7.8 |
+| 12288×12288 | 201 | 216 | 18.5 |
+| 16384×16384 | **246** | **252** | 35.8 |
+
+> **Note:** GPU-side BF16→NVF4 quantization with unit scaling. No host-device copies. Ideal for memory-bound LLM inference with 4x bandwidth reduction vs BF16.
+
 ---
 
 ## Installation
@@ -695,6 +810,7 @@ PyGPUkit/
 | **v0.2.10** | **Dynamic cuBLASLt loading**, CUDA Graph optimizations, descriptor caching |
 | **v0.2.11** | **Batch decode** (6.8x speedup), Decode Strategy framework, Driver API async, Dual CUDA builds, RTX 5090 (SM120) |
 | **v0.2.12** | **Advanced audio processing** (ISTFT, Griffin-Lim, HPSS, CQT, pitch detection, time stretch) |
+| **v0.2.15** | **FP8 I/O GEMM** (blockwise scaling), Pure NVF4 (446 TFLOPS), New math ops (sin, cos, sqrt, rsqrt, abs, neg, clamp, where, sigmoid, tanh, argmax, min, sum_axis) |
 
 ### Planned