feat: v0.2.16 - SM120 GEMV/GEMM optimization and comprehensive benchmarks#121
Merged
feat: v0.2.16 - SM120 GEMV/GEMM optimization and comprehensive benchmarks#121
Conversation
High-performance Python compiler with GPU support from Exaloop. Potential future collaboration opportunity. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 1 implementation for Issue #110: - Add MoE CUDA kernels (topk, softmax, permutation, gather, scatter) - Add MoELayer Python class with router and expert FFN dispatch - Extend ModelSpec with MoE fields (moe_gate, expert_*_proj, is_moe) - Add MIXTRAL_SPEC for Mixtral 8x7B model detection - Extend TransformerConfig with num_experts, num_experts_per_tok - Add load_mixtral_from_safetensors() loader - Add pybind11 bindings for all MoE ops Tested: All MoE kernels and MoELayer integration tests pass on SM 120a 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Example CLI chat for MoE models:
- Mixtral-Instruct chat template formatting
- Streaming UTF-8 output with byte decoder
- M=1 decode with KV cache
- Auto-detection of MoE models via ModelSpec
Usage:
python examples/chat_cli_moe.py \
--model /path/to/model.safetensors.index.json \
--tokenizer /path/to/tokenizer.json
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Thinking model chat CLI with:
- <think>...</think> block parsing and display
- Streaming output with thinking/answer separation
- Thinking display toggle (/think command)
- Auto-detect model and tokenizer paths
- Recommended params: temp=0.6, top_k=20, top_p=0.95
Usage:
python examples/chat_cli_thinking.py \
--model F:/LLM/Qwen3-4B-Thinking-2507
Tested with Qwen3-4B-Thinking-2507 on RTX 5090
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add --cuda-graph flag for reduced kernel launch overhead - Add decode_one_token() helper to dispatch Graph/Non-Graph decode - Display CUDA Graph status in chat UI 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add Float8E4M3, Float8E5M2 to Rust Dtype enum - Add FP8 dequantization with block-wise scaling (Python) - Add QWEN3_MOE_SPEC for Qwen3 MoE models - Update detect_model_spec to detect Qwen3-MoE architecture - Support both num_experts and num_local_experts config keys Enables loading FP8 quantized models like Qwen3-30B-A3B-Instruct-FP8. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GPUArray uses _native, not _array. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add --cuda-graph flag for reduced kernel launch overhead - Add decode_one_token() helper to dispatch Graph/Non-Graph decode - Display CUDA Graph status in chat UI 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
W8A16 GEMV for FP8 E4M3 quantized LLM weights: - FP8 E4M3 lookup table in constant memory - Block-wise scale factor handling (128x128) - Online dequantization during compute (no pre-dequant) - Memory savings: 31GB FP8 stays at 31GB Components: - native/ops/gemv/gemv_fp8.cuh: FP8 GEMV CUDA kernel - LinearFP8 layer with M=1 GEMV optimization - Python API: gemv_fp8_bf16, fp8_init_lut, fp8_get_sizes - transpose() now supports uint8 for FP8 weights This enables Qwen3-30B-A3B-FP8 inference in 32GB VRAM. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Attention/MLP now accept Linear or LinearFP8 directly - loader.load_linear() returns LinearFP8 for FP8 weights - FP8 weights stay as uint8, no memory-doubling dequant - MLP skips fused gate_up for FP8 (can't concat uint8) - transpose() now supports uint8 for FP8 weight transpose This enables loading Qwen3-30B-A3B-FP8 (31GB) in 32GB VRAM. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update MoELayer to accept LinearFP8 expert weights - Update Attention to skip fused QKV projection for FP8 - Update forward_fixed_cache methods to handle FP8 separately - Update loader to use load_linear for MoE expert weights - Enable Qwen3-30B-A3B-FP8 (31GB) loading without dequantization Test results (Qwen3-30B-A3B-FP8, RTX 5090): - 48 layers, 128 experts/layer - All attention and expert weights loaded as LinearFP8 - Hidden states: 6148ms per token - Logits shape: (1, 151936) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename Linear class to LinearBF16 to match LinearFP8 naming convention - Add backward compatibility alias: Linear = LinearBF16 - Update all type annotations to use LinearBF16 - Export both LinearBF16 and LinearFP8 from __init__.py This makes the naming explicit about weight data type: - LinearBF16: BF16/FP16 weights with BF16 GEMV - LinearFP8: FP8 E4M3 weights with online dequantization 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add 9 skills: build, benchmark, lint, typecheck, test, precommit, check-all, chat-test, kernel-dev - Add 5 subagents: kernel-reviewer, perf-analyzer, api-designer, commit-helper, doc-generator - Unify SM 120 -> 120a (required for RTX 5090) - Remove build_cuda13.bat (Git Bash only) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Core principles and project philosophy - Accept/reject criteria for contributions - Architectural invariants (layer model, Rust components) - Performance and safety rules - Development workflow and commit format - Review criteria and automatic rejection rules 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
README.md: - Add CONTRIBUTING.md link and quick start guide - Add .claude/ directory to project structure CLAUDE.md: - Add Claude Code Configuration section - Document 9 skills and 5 subagents 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
CLAUDE.md:
- Add MatMul Kernel Structure section
- Path convention: {gemm|gemv}/{input}/{output}/{arch}/{compute}_{suffix}.cu
- Examples for BF16, FP8, NVF4, TF32 kernels
kernel-dev skill:
- Update file locations with new structure
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Change benchmark examples from RTX 3090 Ti to RTX 5090 - Update performance targets in skills and agents - Keep RTX 3090 Ti as secondary reference 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Configure 4 MCP servers for enhanced development: - memory: persist benchmark results and decisions across sessions - sqlite: track benchmark history in database - sequential-thinking: structured problem solving for kernel optimization - git: advanced git operations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Serena provides LSP-based code understanding: - Semantic code retrieval and editing - Support for 30+ languages (Python, C++, Rust, etc.) - IDE-like capabilities for the LLM 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- build.sh now saves logs to .claude/logs/build/
- Log format: build_sm{SM}_cuda{VERSION}_{TIMESTAMP}.log
- Output to both console and log file via tee
- Auto-cleanup keeps last 10 logs
- Add build-log skill for log analysis
- Update .serena/project.yml with initial_prompt
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Reorganize matmul folder: gemm/[input]/[output]/[arch]/<kernel>.cu - Move gemv from ops/gemv/ to ops/matmul/gemv/bf16/bf16/sm120/ - Split heavy .cuh files into .cuh (declarations) + .cu (implementations): - f32_ampere.cuh -> f32_ampere.cuh + f32_ampere.cu - nvf4.cuh -> nvf4.cuh + nvf4_kernels.cu - fp8.cuh -> fp8.cuh + fp8_kernels.cu - Convert FP8 E4M3 LUT from runtime to compile-time initialization - Remove deprecated fp8_init_lut() function Build verified: 18 matmul/gemv functions, 28 source files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update CLAUDE.md with LLM models storage location - Add usage example for model loading - Update .serena/project.yml initial_prompt 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- chat_cli_moe.py: Add multi-model chat template support - Add --chat-template arg (qwen, mistral, llama2, llama3, chatml) - Auto-detect template from model spec name - Support multiple EOS tokens (</s>, <|im_end|>, <|eot_id|>) - layers.py: Fix MoE expert output collection - GPUArray.__getitem__ returns copy, not view - copy_to to slice was ineffective - Changed to list-based collection with CPU concat - matmul.py: Remove fp8_init_lut native call - LUT is defined as __device__ __constant__ in C++ - Initialized at compile time, no runtime init needed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add TensorCore GEMM for FP8 weight x BF16 activation (W8A16 format): - New kernel: native/ops/matmul/gemm/fp8/bf16/sm120/w8a16_gemm.cu - Uses mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 - FP8 weights dequantized on-the-fly during shared memory load - Block-wise scaling (128x128 blocks) supported LinearFP8 now uses W8A16 GEMM for M>1 instead of CPU dequantization: - M=1: FP8 GEMV (unchanged) - M>1: W8A16 GEMM (new, more efficient for MoE batches) API additions: - w8a16_gemm_sm120(A, B_fp8, B_scale) -> C - gemv_fp8_bf16_batched(A, B_fp8, B_scale) -> C (Python wrapper) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Python version badge - CUDA 13.x badge - SM architectures badge (80/86/89/90/100/120a) - GitHub stars badge - Downloads badge - Code style (ruff) badge - Update .serena/project.yml with ignored_paths 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add concat_axis0 uint8 kernel for FP8 weight stacking - Add memcpy_device_to_device_offset for efficient GPU memory copy - Implement grouped GEMM kernel for MoE (disabled, needs debugging) - Add grouped_gemm_fp8_bf16 Python wrapper - Prepare MoELayer for grouped GEMM optimization Performance: Prefill 31.8s -> 21.0s, Decode 0.6 -> 1.0 tok/s 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fixed critical bug in grouped GEMM where all rows in a CUDA block used the same expert's weights. Now uses per-row expert IDs for correct expert routing. Changes: - Rewrote grouped_gemm.cu with v2 API using row_expert_ids - Added expand_expert_offsets kernel to convert offsets to row IDs - Added grouped_gemm_fp8_bf16_v2 binding and Python wrapper - Updated MoELayer to use v2 API Performance (Qwen3-30B-A3B MoE, RTX 5090): - Fallback path: ~8.5s prefill - Grouped GEMM v2: ~4-5s prefill (2x faster) - Same output quality (top tokens match) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GEMV FP8 Optimization: - Replace old gemv_fp8_bf16 with optimized B[N,K] layout version - Use vectorized uint4 loads, warp-level reduction, shared memory - Achieve 3-9x speedup over BF16 GEMV - Decode throughput: 3.2 -> 4.2 tok/s (+31%) W8A16 GEMM Optimization (prefill): - Switch to FP8 TensorCore MMA (m16n8k32, 2x throughput vs BF16) - Fast BF16->FP8 quantization via bit manipulation (no frexpf) - Transpose B to [N,K] layout for col-major MMA access - Coalesced global memory loads for B matrix Benchmark results (RTX 5090): - M=512: 102 -> 143 TFLOPS (+40%) - M=1024: 137 -> 175 TFLOPS (+27%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add fp8_cutlass_v2.cu with template-based GEMM kernel - Add bench_fp8_fp8_gemm.py for benchmarking - Fix lambda capture error in ops_bindings.cpp FP8 GEMM Tuning Findings (RTX 5090 SM120): - Only 128x128x128 tile supported (CUTLASS SM120 constraint) - Extended K (256/512) causes "Stages >= 2" shared memory overflow - M/N < 128 causes "Cooperative kernel >= 128" error - Ping-pong schedule NOT supported for FP8 blockwise scaling - Realistic FP8 ceiling: ~500 TFLOPS (not 1200+ which is NVF4/sparse) Benchmark Results: - M=128: 47 TFLOPS (~9% of ceiling) - M=1024: 134 TFLOPS (~27% of ceiling) - M=4096: 202 TFLOPS (~40% of ceiling) - M=8192: 226 TFLOPS (~45% of ceiling) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Key optimizations: - Use fast FP8xFP8 GEMM (239 TFLOPS) internally with type conversions - Combined temp buffer allocation reduces cudaMalloc overhead - Cached thread-local scale buffers avoid repeated allocations - Use D_fp8 for both C and D (beta=0 optimization) Benchmark results (RTX 5090, SM120): - M=4096: 181.2 TFLOPS (2.02x vs blockwise) - M=8192: 212.1 TFLOPS (2.00x vs blockwise) - Peak efficiency: 84.8% of pure FP8xFP8 ceiling (239.5 TFLOPS) Added w8a16_optimized_sm120 Python binding for the optimized path. Overhead reduced from 31.8% (2043us) to 15.2% (720us). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
SM120 (Blackwell GeForce) does NOT have native Int8 TensorCore support. Only SM100/SM101/SM110 have tcgen05.mma.kind::i8. This implementation uses FP8 TensorCore as an approximation: 1. Convert Int8 inputs to FP8 E4M3 2. Run fast FP8xFP8 GEMM with BF16 output (avoids saturation) 3. Convert BF16 to Int32/Int8 Benchmark results (RTX 5090, M=8192, K=4096, N=14336): - Int8->Int32: 135.2 TFLOPS - Int8->Int8: 140.2 TFLOPS - Correctness: PASS (3.5% precision loss from FP8 approximation) API: - native.int8_gemm_available() -> bool - native.int8_gemm_int32_sm120(A, B, D, scale_A, scale_B, descale_D) - native.int8_gemm_int8_sm120(A, B, D, scale_A, scale_B, descale_D) :robot: Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
SM120 (RTX 5090) has no native signed Int4 TensorCore support. This implementation unpacks Int4 to Int8, then uses FP8 TensorCore for computation. Pipeline: Int4 -> Int8 (unpack) -> FP8 -> TensorCore -> BF16 -> Int32/Int8 Benchmark results (RTX 5090, LLM shapes): - M=128: 6.4 TFLOPS - M=1024: 41.7 TFLOPS - M=8192: 122.4 TFLOPS Correctness: - Small values (-2 to 2): 0.00% error - Full Int4 range (-8 to 7): 0.11% error Files: - native/ops/matmul/gemm/int4/int4/sm120/int4_via_int8.cu - Python bindings for int4_gemm_int32_sm120 and int4_gemm_int8_sm120 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Warp-level reduction Int4 GEMV for single-token decode in LLM inference. Uses shared memory for activation vector, warp-level shuffle reduction. Benchmark results (RTX 5090, LLM shapes): - K=4096, N=14336: 2.51 TFLOPS (46.7 us) - K=8192, N=28672: 4.81 TFLOPS (97.7 us) Correctness: 0% error (exact integer math) Note: GEMV is memory-bandwidth bound, TFLOPS is lower than GEMM. Vectorized kernel has a bug (disabled for now). Files: - native/ops/matmul/gemv/int4/int4/sm120/int4_gemv.cuh - native/ops/matmul/gemv/int4/int4/sm120/int4_gemv.cu - Python bindings for int4_gemv_int32_sm120 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
SM120 (RTX 5090) does NOT have native Int8 TensorCore MMA instructions. This kernel uses CUDA cores with vectorized dp4a (dot product of 4 Int8 values). Benchmark results (RTX 5090): - M=128: 32.31 TFLOPS - M=512: 40.30 TFLOPS - M=4096: 43.51 TFLOPS - M=8192: 42.85 TFLOPS Correctness: PASS (0% error - exact Int32 accumulation) dp4a: D = A.x*B.x + A.y*B.y + A.z*B.z + A.w*B.w + C where A, B are int8x4 packed in uint32, C and D are int32 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Refactor W8A16 GEMM kernel (simplify code structure) - Add W8A16 CUTLASS benchmark and correctness tests - Fix FP8 and Int8 GEMM benchmark imports - Minor LLM decode/layers cleanup 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
RTX 5090 (SM120a, CUDA 13.1) performance: Standard Precision (8192x8192): - FP32: 80 TFLOPS - TF32: 87 TFLOPS - FP16: 170 TFLOPS - BF16: 173 TFLOPS Quantized GEMM (M=8192, K=4096, N=14336): - FP8xFP8: 217 TFLOPS - W8A16: 50 TFLOPS - Int8 (via FP8): 142 TFLOPS - Int8 (dp4a): 44 TFLOPS (exact) - Int4 (via Int8): 121 TFLOPS NVF4 GEMM: - 8192x8192: 261 TFLOPS - 16384x16384: 398 TFLOPS 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GEMV Performance (RTX 5090, SM120a, M=1): | Layer | BF16 | FP8 | NVF4 | Int4 | |-------|------|-----|------|------| | Qwen-7B hidden (4096x4096) | 98 us | 32 us | 140 us | 31 us | | Qwen-7B MLP up (4096x14336) | 154 us | 44 us | 141 us | 47 us | | Qwen-7B MLP down (14336x4096) | 432 us | 47 us | 404 us | 58 us | | Qwen-72B hidden (8192x8192) | 262 us | 49 us | 252 us | 51 us | | Qwen-72B MLP up (8192x29568) | 356 us | 179 us | 436 us | 112 us | | Qwen-72B MLP down (29568x8192) | 863 us | - | 1393 us | 129 us | Key findings: - FP8 GEMV: 3-9x faster than BF16, 50% memory - Int4 GEMV: Best for very large K (29568+) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add new GEMV kernel that uses FP8 for both activation and weights. Key advantage over W8A16: shared memory is K bytes instead of 2*K, enabling support for K up to 48K without overflow. Benchmark (RTX 5090): | Layer | K | N | W8A16 | FP8/FP8/FP8 | |--------------------|-------|-------|--------|-------------| | Qwen-7B hidden | 4096 | 4096 | 29 us | 31 us | | Qwen-7B MLP up | 4096 | 14336 | 44 us | 43 us | | Qwen-7B MLP down | 14336 | 4096 | 48 us | 49 us | | Qwen-72B hidden | 8192 | 8192 | 46 us | 47 us | | Qwen-72B MLP up | 8192 | 29568 | 178 us | 178 us | | Qwen-72B MLP down | 29568 | 8192 | FAIL | 223 us | W8A16 fails at K=29568 (smem=59KB>48KB), FP8/FP8 handles it. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add optimized kernel variant with: - 128-bit vector loads (uint4, 16 FP8 values at once) - __ldg() for cached global memory reads - 4 independent accumulators to hide FMA latency - Aggressive loop unrolling Benchmark (RTX 5090): | Layer | K | N | Before | After | Speedup | |--------------------|-------|-------|--------|--------|---------| | Qwen-7B hidden | 4096 | 4096 | 31 us | 30 us | 1.02x | | Qwen-7B MLP up | 4096 | 14336 | 44 us | 44 us | 0.99x | | Qwen-7B MLP down | 14336 | 4096 | 51 us | 48 us | 1.06x | | Qwen-72B hidden | 8192 | 8192 | 50 us | 47 us | 1.05x | | Qwen-72B MLP up | 8192 | 29568 | 179 us | 178 us | 1.00x | | Qwen-72B MLP down | 29568 | 8192 | 223 us | 189 us | 1.17x | 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Both activation (A) and weight (B) are NVF4 quantized, reducing shared memory usage from K*2 bytes (W4A16) to K/2 bytes. Supports K up to ~90K without shared memory overflow. Benchmark (RTX 5090): - K=4096, N=4096: 65 us (1.7x faster than W4A16) - K=29568, N=8192: 959 us (1.5x faster than W4A16) Note: Still slower than FP8/FP8 due to column-major B layout causing non-coalesced memory access. Row-major optimization TBD. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Change B matrix layout from column-major [K/2, N] to row-major [N, K/2] for coalesced memory access. Threads now read contiguous K elements instead of strided N elements. Changes: - Add quantize_bf16_to_nvf4_rowmajor() for row-major output - Update gemv_nvf4_pure_kernel to use B_row pointer indexing - Update bindings to expect [N, K/2] shape Benchmark (RTX 5090, K=3584, N=18944): | Layout | Time | Bandwidth | |--------------|--------|-----------| | Column-major | 908 us | 40 GB/s | | Row-major | 304 us | 119 GB/s | | Speedup | 3.0x | | Comparison with other kernels: - NVF4/NVF4 row-major: 304 us - W4A16: 119 us (2.5x faster) - FP8/FP8: 19 us (15x faster) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change kernel design from warp-level reduction to W4A16-style: - 1 thread handles 1 output element (no warp reduction) - Pre-scaled LUT in registers for B dequantization - Reduced block count: N/256 instead of N/8 Benchmark (RTX 5090, K=3584, N=18944): | Kernel | Time | Bandwidth | Speedup | |---------------|--------|-----------|---------| | Old (warp) | 304 us | 119 GB/s | 1.00x | | New (1t=1out) | 219 us | 165 GB/s | 1.39x | | W4A16 | 104 us | - | 2.93x | NVF4/NVF4 is slower than W4A16 due to 2x scale decoding and 2x LUT lookups per element. Trade-off: 4x smaller A memory. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document why W4A16 is faster than NVF4/NVF4 despite both using 4-bit: - W4A16: 1x dequant (B only), A is BF16 (free conversion) - NVF4/NVF4: 2x dequant (A + B), 2x scale loads, 2x LUT lookups Benchmark (RTX 5090, K=3584, N=18944): - W4A16: 104 us - NVF4/NVF4: 219 us (2.1x slower) This follows PyGPUkit's "explicit over implicit" philosophy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmark all GEMV kernels on RTX 5090 (SM120a): - BF16: baseline (119 us) - FP8/FP8: 6.2x faster (19 us) - best for SM120 - NVF4/BF16 (W4A16): 1.12x faster (106 us) - NVF4/NVF4: 0.55x (217 us) - memory priority Added tests/bench_all_gemv.py for reproducible benchmarks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Cleanup: Remove redundant/slower kernel implementations Removed files: - fp8_kernels.cu: Basic FP8 GEMV ([K,N] layout) - replaced by fp8_opt_kernels.cu ([N,K] layout) - int8_via_fp8.cu: Int8 GEMM via FP8 approximation - not exposed in Python bindings Updated: - CMakeLists.txt: Remove deleted files from build - ops_bindings.cpp: Remove bindings for deleted functions - nvf4.cu: Remove C API for basic FP8 GEMV - fp8.cuh: Keep only FP8_E4M3_LUT and helpers (used by optimized kernel) Retained optimized versions: - fp8_opt_kernels.cu: FP8 GEMV with [N,K] layout (6-22x faster than BF16) - int8_native.cu: Native Int8 GEMM using dp4a (exact computation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests require GPU native module which is not available in CI. Added pytest skipif marker to skip tests gracefully. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
memory_kernels.cuh uses uint8_t without including cstdint, causing CI build failure on Linux. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GEMV Benchmark (RTX 5090, SM120a)
New Files
native/ops/matmul/gemv/fp8/fp8/- FP8/FP8 GEMVnative/ops/matmul/gemv/nvf4/nvf4/- NVF4/NVF4 GEMVnative/ops/matmul/gemm/int4/- Int4 GEMMnative/ops/matmul/gemm/int8/- Int8 GEMMnative/ops/moe/- MoE kernelstests/bench_all_gemv.py- Comprehensive GEMV benchmarkTest plan
🤖 Generated with Claude Code