fix(w8a16): correct MMA A-fragment register mapping + MoE improvements#125
Merged
fix(w8a16): correct MMA A-fragment register mapping + MoE improvements#125
Conversation
Benchmark results (RTX 5090, M x 4096 x 14336): - M=1024: 162.7 TFLOPS (v5) vs 135.2 TFLOPS (v2) = +20% improvement - M=8192: 254.5 TFLOPS (v5) vs 253.0 TFLOPS (v2) = +0.6% Key optimization: Cache scale factor buffers to avoid per-call allocation overhead. Uses same CUTLASS configuration as v2 but with persistent buffers. New kernels: - v5: 128x128x128 tile with cached scales (best for small/large M) - v6: 128x256x64 tile with cached scales - v7: 256x128x64 tile with cached scales - v8: 128x128x64 tile with cached scales 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements accurate W8A8 GEMV kernel targeting <0.5% error: - Smaller scale blocks: 32 elements (vs 128 in fast version) - Kahan summation for reduced accumulation error - Double precision accumulator option Files added: - fp8_accurate.cuh: Kernel definitions with KahanAccumulator - fp8_accurate.cu: Launch functions - test_fp8_accurate_gemv.py: Accuracy verification tests Benchmark (K=4096, N=4096, scale=1.0): - Fast kernel: 0.17% error - Accurate kernel: 0.17% error - Both pass <0.5% target Note: Real accuracy improvement requires per-block quantization in actual LLM inference where scale factors vary per block. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rewrote fp8_accurate kernel to use same optimized structure as fast version: - 128-bit vector loads (uint4) for 16 FP8 values at once - 4 independent accumulators to hide FMA latency - __ldg() for cached global memory reads - Removed Kahan summation (overhead not justified) - Removed double precision accumulator Only difference from fast version: SCALE_BLOCK_SIZE=32 (vs 128) Results (RTX 5090): ┌──────────┬──────────┬────────────┬────────────────┬──────────┐ │ K │ N │ Fast (us) │ Accurate (us) │ Slowdown │ ├──────────┼──────────┼────────────┼────────────────┼──────────┤ │ 4096 │ 4096 │ 28.7 │ 27.3 │ 0.95x │ │ 4096 │ 14336 │ 42.3 │ 42.0 │ 0.99x │ │ 14336 │ 4096 │ 46.7 │ 46.3 │ 0.99x │ └──────────┴──────────┴────────────┴────────────────┴──────────┘ Accuracy: 0.17% relative error (target: <0.5%) ✓ Slowdown: 0.95-0.99x (target: 1.5-2x) ✓ Previous version was 18-37x slower due to inefficient loop structure. Issue #123 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- W8A16 kernel expects [N/128, K/128] blockwise scales, not [N] per-row - Fixed check_rel_error.py to use correct scale format - Updated README: W8A16 error ~12% -> ~6% (with correct scales) Measured errors (vs FP32): - BF16: 0.63% - W8A16: 5.64% (was 12% with wrong scales) - W8A8: 9.15% 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- New bf16_opt.cuh/cu with warp-level K reduction - B[N,K] row-major layout for coalesced memory access - Shared memory for activation broadcast - 128-bit vectorized loads (8 BF16 per load) - 4 FP32 accumulators to hide FMA latency Benchmark (RTX 5090, SM120): K=4096, N=4096: 64.8us -> 11.7us (5.54x speedup) K=4096, N=14336: 125.7us -> 71.4us (1.76x speedup) K=14336, N=4096: 411.8us -> 74.0us (5.57x speedup) Correctness: ~0.3% error vs FP32 reference 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use gemv_bf16_opt_sm120 (B[N,K] layout) for M=1 decode - Fallback to old gemv_bf16 (B[K,N]) if SM < 80 - No transpose needed: use self.weight directly Performance: 5.5x faster GEMV for LLM inference 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fixed the W8A16 GEMM scalar fallback kernel returning half values (K>=32) or zeros (K<32) due to uninitialized FP8->F32 LUT. Root cause: The FP8 LUT was defined as __device__ __constant__ with initializer in a header file (fp8.cuh). When included in multiple .cu files, CUDA linker didn't properly merge symbols, causing W8A16 GEMM to read uninitialized values. Fix: Use runtime initialization (cudaMemcpyToSymbol) like grouped_gemm, with a local LUT copy in w8a16_gemm.cu. Changes: - Added local g_fp8_lut[256] in w8a16_gemm.cu - Added pygpukit_w8a16_gemm_init_lut() for runtime initialization - Added Python binding for init function - Updated matmul.py to call w8a16_gemm_init_lut() before W8A16 GEMM 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Temporary workaround to debug MoE model output issues. System prompt will be re-enabled after root cause is identified. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The W8A16 TensorCore kernel was producing garbage output for M >= 16 due to incorrect A fragment register mapping. MMA m16n8k16 expects: reg[0] = rows 0-7, cols 0-1 reg[1] = rows 8-15, cols 0-1 reg[2] = rows 0-7, cols 8-9 reg[3] = rows 8-15, cols 8-9 The bug was swapping registers 1 and 2: - OLD: row = groupID + 8 * (p >> 1), col = tid*2 + (p & 1) * 8 - NEW: row = groupID + 8 * (p & 1), col = tid*2 + (p >> 1) * 8 Verified with Qwen3-30B-A3B-Instruct-2507-FP8 MoE model. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests FP8 MoE model (Qwen3-30B-A3B) with prompts of different token counts to verify W8A16 GEMM works correctly for both M < 16 (scalar) and M >= 16 (TensorCore) paths. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- grouped_gemm.cu: improved kernel configuration - moe_kernels.cuh, topk_kernels.cuh: minor fixes - layers.py, model.py: MoE inference improvements - chat_cli_moe.py: minor update - test_fp8_accurate_gemv.py: test improvements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add GEMV bandwidth utilization table (BF16: 98-101% peak) - Add v0.2.18 What's New section - Update roadmap with v0.2.18 BF16 GEMV with B[N,K] layout achieves near-peak bandwidth: - 2048x8192: 1763 GB/s (98%) - 4096x14336: 1810 GB/s (101%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add pytest skip markers for: - tokenizers package (not in CI deps) - Model files at F:/LLM/ (local only) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add pytest skip marker for native module availability. These tests require CUDA hardware. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Re-run comprehensive GEMV benchmarks on RTX 5090: - GEMV Latency by Layer: Updated all kernel timings - Comprehensive GEMV Benchmark: Updated gate_proj results - Performance by Layer Type: Updated speedup ratios Key improvements: - W8A16 now as fast as W8A8 for most sizes (optimized kernel) - FP8/FP8 (W8A8) achieves 6-24x speedup over BF16 - Int4 excels at very large K dimensions (29568+) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use optimized BF16 GEMV (B[N,K] layout) as standard - BF16 now matches W8A8 speed for small-to-medium sizes (31us vs 31us @ 4096x4096) - Update all GEMV benchmark tables with fresh measurements - Add all 6 kernel columns: BF16, W8A16, W8A8, W4A16, W4A4, Int4 - Match main branch README format exactly Key results (RTX 5090): - BF16: 31-324us (2x faster with B[N,K] layout) - W8A8: 31-204us (fastest for most sizes) - Int4: 33-125us (fastest for large K dimensions) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
BREAKING CHANGE: gemv_bf16() now expects B[N,K] layout instead of B[K,N] - Use optimized gemv_bf16_opt_sm120 kernel with warp-level reduction - B[N,K] layout provides better memory coalescing (2x faster) - Remove alpha/beta parameters (not supported by optimized kernel) - Update docstring to reflect new layout Migration: If you have weights in [K,N] format, transpose them: b_new = b.T # [K,N] -> [N,K] 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove pygpukit_gemv_bf16() and pygpukit_gemv_bf16_auto() from nvf4.cu - Remove extern declaration from ops_bindings.cpp - Remove unnecessary include of bf16_cutlass.cuh from nvf4.cu The optimized gemv_bf16_opt_sm120 with B[N,K] layout is now the only BF16 GEMV kernel exposed to Python. The old kernel with B[K,N] layout is retained in bf16_cutlass.cuh for internal C++ tests only. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
(lane % 4)when should use(lane / 4) % 4for upper 2 lanesTest plan
🤖 Generated with Claude Code