fix(w8a16): correct MMA A-fragment register mapping + MoE improvements by m96-chan · Pull Request #125 · m96-chan/PyGPUkit

m96-chan · 2025-12-29T21:14:46Z

Summary

fix(w8a16): Correct MMA A-fragment register mapping for m16n8k16 instruction
- Fixed bit extraction bug: used (lane % 4) when should use (lane / 4) % 4 for upper 2 lanes
- MoE model now produces correct output
perf(gemv): Add optimized BF16 GEMV kernel with B[N,K] layout
- Warp-level reduction, shared memory, vectorized loads
- ~90% bandwidth utilization
feat(gemv): Add accurate FP8/FP8 GEMV kernel with <0.5% error
test(moe): Add MoE inference test for various prompt lengths

Test plan

MoE model produces non-garbage output
BF16 GEMV achieves ~90% bandwidth
FP8 GEMV accurate kernel maintains <0.5% error
Pre-commit checks pass (ruff lint, mypy)

🤖 Generated with Claude Code

Benchmark results (RTX 5090, M x 4096 x 14336): - M=1024: 162.7 TFLOPS (v5) vs 135.2 TFLOPS (v2) = +20% improvement - M=8192: 254.5 TFLOPS (v5) vs 253.0 TFLOPS (v2) = +0.6% Key optimization: Cache scale factor buffers to avoid per-call allocation overhead. Uses same CUTLASS configuration as v2 but with persistent buffers. New kernels: - v5: 128x128x128 tile with cached scales (best for small/large M) - v6: 128x256x64 tile with cached scales - v7: 256x128x64 tile with cached scales - v8: 128x128x64 tile with cached scales 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implements accurate W8A8 GEMV kernel targeting <0.5% error: - Smaller scale blocks: 32 elements (vs 128 in fast version) - Kahan summation for reduced accumulation error - Double precision accumulator option Files added: - fp8_accurate.cuh: Kernel definitions with KahanAccumulator - fp8_accurate.cu: Launch functions - test_fp8_accurate_gemv.py: Accuracy verification tests Benchmark (K=4096, N=4096, scale=1.0): - Fast kernel: 0.17% error - Accurate kernel: 0.17% error - Both pass <0.5% target Note: Real accuracy improvement requires per-block quantization in actual LLM inference where scale factors vary per block. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Rewrote fp8_accurate kernel to use same optimized structure as fast version: - 128-bit vector loads (uint4) for 16 FP8 values at once - 4 independent accumulators to hide FMA latency - __ldg() for cached global memory reads - Removed Kahan summation (overhead not justified) - Removed double precision accumulator Only difference from fast version: SCALE_BLOCK_SIZE=32 (vs 128) Results (RTX 5090): ┌──────────┬──────────┬────────────┬────────────────┬──────────┐ │ K │ N │ Fast (us) │ Accurate (us) │ Slowdown │ ├──────────┼──────────┼────────────┼────────────────┼──────────┤ │ 4096 │ 4096 │ 28.7 │ 27.3 │ 0.95x │ │ 4096 │ 14336 │ 42.3 │ 42.0 │ 0.99x │ │ 14336 │ 4096 │ 46.7 │ 46.3 │ 0.99x │ └──────────┴──────────┴────────────┴────────────────┴──────────┘ Accuracy: 0.17% relative error (target: <0.5%) ✓ Slowdown: 0.95-0.99x (target: 1.5-2x) ✓ Previous version was 18-37x slower due to inefficient loop structure. Issue #123 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- W8A16 kernel expects [N/128, K/128] blockwise scales, not [N] per-row - Fixed check_rel_error.py to use correct scale format - Updated README: W8A16 error ~12% -> ~6% (with correct scales) Measured errors (vs FP32): - BF16: 0.63% - W8A16: 5.64% (was 12% with wrong scales) - W8A8: 9.15% 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- New bf16_opt.cuh/cu with warp-level K reduction - B[N,K] row-major layout for coalesced memory access - Shared memory for activation broadcast - 128-bit vectorized loads (8 BF16 per load) - 4 FP32 accumulators to hide FMA latency Benchmark (RTX 5090, SM120): K=4096, N=4096: 64.8us -> 11.7us (5.54x speedup) K=4096, N=14336: 125.7us -> 71.4us (1.76x speedup) K=14336, N=4096: 411.8us -> 74.0us (5.57x speedup) Correctness: ~0.3% error vs FP32 reference 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Use gemv_bf16_opt_sm120 (B[N,K] layout) for M=1 decode - Fallback to old gemv_bf16 (B[K,N]) if SM < 80 - No transpose needed: use self.weight directly Performance: 5.5x faster GEMV for LLM inference 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fixed the W8A16 GEMM scalar fallback kernel returning half values (K>=32) or zeros (K<32) due to uninitialized FP8->F32 LUT. Root cause: The FP8 LUT was defined as __device__ __constant__ with initializer in a header file (fp8.cuh). When included in multiple .cu files, CUDA linker didn't properly merge symbols, causing W8A16 GEMM to read uninitialized values. Fix: Use runtime initialization (cudaMemcpyToSymbol) like grouped_gemm, with a local LUT copy in w8a16_gemm.cu. Changes: - Added local g_fp8_lut[256] in w8a16_gemm.cu - Added pygpukit_w8a16_gemm_init_lut() for runtime initialization - Added Python binding for init function - Updated matmul.py to call w8a16_gemm_init_lut() before W8A16 GEMM 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Temporary workaround to debug MoE model output issues. System prompt will be re-enabled after root cause is identified. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The W8A16 TensorCore kernel was producing garbage output for M >= 16 due to incorrect A fragment register mapping. MMA m16n8k16 expects: reg[0] = rows 0-7, cols 0-1 reg[1] = rows 8-15, cols 0-1 reg[2] = rows 0-7, cols 8-9 reg[3] = rows 8-15, cols 8-9 The bug was swapping registers 1 and 2: - OLD: row = groupID + 8 * (p >> 1), col = tid*2 + (p & 1) * 8 - NEW: row = groupID + 8 * (p & 1), col = tid*2 + (p >> 1) * 8 Verified with Qwen3-30B-A3B-Instruct-2507-FP8 MoE model. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tests FP8 MoE model (Qwen3-30B-A3B) with prompts of different token counts to verify W8A16 GEMM works correctly for both M < 16 (scalar) and M >= 16 (TensorCore) paths. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- grouped_gemm.cu: improved kernel configuration - moe_kernels.cuh, topk_kernels.cuh: minor fixes - layers.py, model.py: MoE inference improvements - chat_cli_moe.py: minor update - test_fp8_accurate_gemv.py: test improvements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add GEMV bandwidth utilization table (BF16: 98-101% peak) - Add v0.2.18 What's New section - Update roadmap with v0.2.18 BF16 GEMV with B[N,K] layout achieves near-peak bandwidth: - 2048x8192: 1763 GB/s (98%) - 4096x14336: 1810 GB/s (101%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add pytest skip markers for: - tokenizers package (not in CI deps) - Model files at F:/LLM/ (local only) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add pytest skip marker for native module availability. These tests require CUDA hardware. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Re-run comprehensive GEMV benchmarks on RTX 5090: - GEMV Latency by Layer: Updated all kernel timings - Comprehensive GEMV Benchmark: Updated gate_proj results - Performance by Layer Type: Updated speedup ratios Key improvements: - W8A16 now as fast as W8A8 for most sizes (optimized kernel) - FP8/FP8 (W8A8) achieves 6-24x speedup over BF16 - Int4 excels at very large K dimensions (29568+) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Use optimized BF16 GEMV (B[N,K] layout) as standard - BF16 now matches W8A8 speed for small-to-medium sizes (31us vs 31us @ 4096x4096) - Update all GEMV benchmark tables with fresh measurements - Add all 6 kernel columns: BF16, W8A16, W8A8, W4A16, W4A4, Int4 - Match main branch README format exactly Key results (RTX 5090): - BF16: 31-324us (2x faster with B[N,K] layout) - W8A8: 31-204us (fastest for most sizes) - Int4: 33-125us (fastest for large K dimensions) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

BREAKING CHANGE: gemv_bf16() now expects B[N,K] layout instead of B[K,N] - Use optimized gemv_bf16_opt_sm120 kernel with warp-level reduction - B[N,K] layout provides better memory coalescing (2x faster) - Remove alpha/beta parameters (not supported by optimized kernel) - Update docstring to reflect new layout Migration: If you have weights in [K,N] format, transpose them: b_new = b.T # [K,N] -> [N,K] 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove pygpukit_gemv_bf16() and pygpukit_gemv_bf16_auto() from nvf4.cu - Remove extern declaration from ops_bindings.cpp - Remove unnecessary include of bf16_cutlass.cuh from nvf4.cu The optimized gemv_bf16_opt_sm120 with B[N,K] layout is now the only BF16 GEMV kernel exposed to Python. The old kernel with B[K,N] layout is retained in bf16_cutlass.cuh for internal C++ tests only. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan and others added 21 commits December 28, 2025 21:58

Merge main: fix GEMV error rates and add rel-error-check skill

e88f282

style: format test_moe_inference.py with ruff

f105ab2

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

test: skip MoE inference test in CI

a3043d6

Add pytest skip markers for: - tokenizers package (not in CI deps) - Model files at F:/LLM/ (local only) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

test: skip FP8 accurate GEMV tests in CI

8746f98

Add pytest skip marker for native module availability. These tests require CUDA hardware. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: use correct function name has_native_module

801b900

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan merged commit a456151 into main Dec 30, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(w8a16): correct MMA A-fragment register mapping + MoE improvements#125

fix(w8a16): correct MMA A-fragment register mapping + MoE improvements#125
m96-chan merged 21 commits intomainfrom
feature/v0.2.18

m96-chan commented Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

m96-chan commented Dec 29, 2025

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant