Skip to content

Conversation

@mikepapadim
Copy link
Member

@mikepapadim mikepapadim commented Dec 3, 2025

Summary

Implements fused dequantize-and-compute patterns for quantized matrix-vector operations,
eliminating intermediate memory round-trips during inference.

Changes

  • Fused Dequantization: Dequantize weights directly in registers before compute,
    avoiding the previous dequantize → store → load → compute pipeline
  • Optimized SGEMV Kernels: Improved memory coalescing and compute utilization
    for the memory-bound decode phase
  • SiLU-GLU Fusion: Combined activation and gating into a single kernel pass

Benchmarks (Llama 3.2 1B FP16)

GPU Before After Speedup
RTX 3070 52 tok/s 62 tok/s +19%
RTX 4090 66 tok/s 86 tok/s +30%

Why This Works

Single-token generation is memory-bandwidth bound (matrix-vector ops).
Fusing dequantization with compute hides quantization overhead by keeping
data in registers rather than writing back to memory between operations.

…optimized matrix-vector kernels, and SiLU-GLU activation
@mikepapadim mikepapadim changed the title Implement FP16 support in TornadoVM by introducing HalfFloat arrays, … Implement deq and compute pattern for SGEEMs Dec 3, 2025
…ce overhead, improve cache utilization, and update task graph setup to integrate fused kernel.
…k graph to integrate `ropeRotationWithCacheCopy` kernel, and remove redundant kernels (`rope` and `copyToCaches`).
…ids, and deprecate redundant tasks in FP16 layer.
…r grid assignments, and enhance attention and FFN block configurations.
…r grid assignments, and enhance attention and FFN block configurations.
…e kernel setup, and enhance FP16 task processing.
…rrays and `mapContextWithQuantizeLogits` kernel, enhancing FP16 computation capabilities
…tailed data flow, task breakdown, and fusion points
…incorporate fused RMS normalization, gate, and up-projection
…FFN task graphs by removing deprecated tasks, consolidating RMS normalization and FFN operations into `rms_ffn_gate_up`.
…FN layers to optimize worker grid configuration.
…tmul`, and `fusedRmsNormQKVMatmul`.

Refactor workers and task graphs to utilize new computations and streamline layer configurations for improved performance and reduced memory transfers.
…te Q/K RMSNorm into a single operation. Cleanup deprecated workers, update task names, and streamline layer configuration.
…e task graphs with fused kernels, reorganize attention and FFN block mapping, and integrate final normalization for non-NVIDIA devices. Add detailed Transformer layer task flow documentation.
…fixes, improved attention computation logic, and optimized handling of large models. Update task graph to revert to `processHeadsFlashAttention` for compatibility.
…it TaskGraph type with `var`, streamline task graph configuration by removing unused temp variables.
…ith fused kernels, update worker grid configurations, and streamline data transfer logic.
…consolidate Q/K/V bias addition into a single operation, and update worker grid configurations. Streamline attention block with optimized task mapping and detailed layer flow documentation.
@mikepapadim mikepapadim requested review from Copilot and orionpapadakis and removed request for Copilot and orionpapadakis December 4, 2025 20:28
mikepapadim and others added 14 commits December 8, 2025 17:32
…g unnecessary `temp` buffers, consolidate transfer flow, and adjust formatting for better readability.
…idate data transfer logic, replace repetitive patterns with reusable methods, and enhance code readability with improved formatting and comments.
… condense parameter lists, streamline formatting, optimize fused operations, and introduce Q8 fused methods for enhanced performance.
…amline task graph setup, enhance fused operations for Q8 quantization, and improve code readability with consistent formatting and comments.
…n kernel arguments in Qwen3 Q8_0 FFN layers.
… the new `fusedRmsNormQKVMatmulQ8_0` kernel in Qwen3 Q8_0 FFN layers.
…le task using `fusedRmsNormFFNGateUpQ8_0` in Qwen3 Q8_0 FFN layers.
…ssociated worker grids in Qwen2 Q8_0 FFN layers.
…pdate worker grids in Qwen2 Q8_0 FFN layers.
…le task and update worker grids in Qwen2 Q8_0 FFN layers.
@mikepapadim mikepapadim merged commit a968bae into main Dec 11, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants