[FP16] Improved performance by fusing dequantize with compute in kernels: 20-30% Inference Speedup #78

mikepapadim · 2025-12-03T15:29:56Z

Summary

Implements fused dequantize-and-compute patterns for quantized matrix-vector operations,
eliminating intermediate memory round-trips during inference.

Changes

Fused Dequantization: Dequantize weights directly in registers before compute,
avoiding the previous dequantize → store → load → compute pipeline
Optimized SGEMV Kernels: Improved memory coalescing and compute utilization
for the memory-bound decode phase
SiLU-GLU Fusion: Combined activation and gating into a single kernel pass

Benchmarks (Llama 3.2 1B FP16)

GPU	Before	After	Speedup
RTX 3070	52 tok/s	62 tok/s	+19%
RTX 4090	66 tok/s	86 tok/s	+30%

Why This Works

Single-token generation is memory-bandwidth bound (matrix-vector ops).
Fusing dequantization with compute hides quantization overhead by keeping
data in registers rather than writing back to memory between operations.

…optimized matrix-vector kernels, and SiLU-GLU activation

…6 task graph setup

…ce overhead, improve cache utilization, and update task graph setup to integrate fused kernel.

…k graph to integrate `ropeRotationWithCacheCopy` kernel, and remove redundant kernels (`rope` and `copyToCaches`).

…eprecate `mapContext` and `quantizeXb`

…ids, and deprecate redundant tasks in FP16 layer.

…r grid assignments, and enhance attention and FFN block configurations.

…e kernel setup, and enhance FP16 task processing.

…rrays and `mapContextWithQuantizeLogits` kernel, enhancing FP16 computation capabilities

…tailed data flow, task breakdown, and fusion points

… provided

…incorporate fused RMS normalization, gate, and up-projection

…FFN task graphs by removing deprecated tasks, consolidating RMS normalization and FFN operations into `rms_ffn_gate_up`.

…FN layers to optimize worker grid configuration.

…tmul`, and `fusedRmsNormQKVMatmul`. Refactor workers and task graphs to utilize new computations and streamline layer configurations for improved performance and reduced memory transfers.

….java into feat/deq-n-compute

…te Q/K RMSNorm into a single operation. Cleanup deprecated workers, update task names, and streamline layer configuration.

…e task graphs with fused kernels, reorganize attention and FFN block mapping, and integrate final normalization for non-NVIDIA devices. Add detailed Transformer layer task flow documentation.

…fixes, improved attention computation logic, and optimized handling of large models. Update task graph to revert to `processHeadsFlashAttention` for compatibility.

…it TaskGraph type with `var`, streamline task graph configuration by removing unused temp variables.

…ith fused kernels, update worker grid configurations, and streamline data transfer logic.

…consolidate Q/K/V bias addition into a single operation, and update worker grid configurations. Streamline attention block with optimized task mapping and detailed layer flow documentation.

…er arrays

…g unnecessary `temp` buffers, consolidate transfer flow, and adjust formatting for better readability.

…idate data transfer logic, replace repetitive patterns with reusable methods, and enhance code readability with improved formatting and comments.

… condense parameter lists, streamline formatting, optimize fused operations, and introduce Q8 fused methods for enhanced performance.

…amline task graph setup, enhance fused operations for Q8 quantization, and improve code readability with consistent formatting and comments.

…0 FFN layers.

…n kernel arguments in Qwen3 Q8_0 FFN layers.

… the new `fusedRmsNormQKVMatmulQ8_0` kernel in Qwen3 Q8_0 FFN layers.

…msNorm` kernel in Qwen3 Q8_0 FFN layers.

…le task using `fusedRmsNormFFNGateUpQ8_0` in Qwen3 Q8_0 FFN layers.

…ssociated worker grids in Qwen2 Q8_0 FFN layers.

…pdate worker grids in Qwen2 Q8_0 FFN layers.

…le task and update worker grids in Qwen2 Q8_0 FFN layers.

…layers

…Qwen2 Q8_0 FFN layers and update worker grids.

…0 FFN layers and update worker grids.

… layers.

…onding worker grid in Qwen3 Q8_0 FFN layers.

…Qwen3 Q8_0 FFN layers and update worker grids.

…0 FFN layers and update worker grids.

…3 Q8_0 layers and improve code readability.

… layers and update worker grids.

Implement FP16 support in TornadoVM by introducing HalfFloat arrays, …

ca2b28a

…optimized matrix-vector kernels, and SiLU-GLU activation

mikepapadim changed the title ~~Implement FP16 support in TornadoVM by introducing HalfFloat arrays, …~~ Implement deq and compute pattern for SGEEMs Dec 3, 2025

mikepapadim added 22 commits December 4, 2025 00:53

Introduce matrix-vector kernel with residual addition and enhance FP1…

f0411ae

…6 task graph setup

Fused Q/K/V matrix-vector multiplication into a single kernel to redu…

6334ac3

…ce overhead, improve cache utilization, and update task graph setup to integrate fused kernel.

Fuse RoPE rotation and KV cache copy into a single kernel, update tas…

46218a7

…k graph to integrate `ropeRotationWithCacheCopy` kernel, and remove redundant kernels (`rope` and `copyToCaches`).

Add mapContextWithQuantize kernel, integrate into task graph, and d…

b48ec62

…eprecate `mapContext` and `quantizeXb`

Refactor logits task graph to optimize kernel setup, update worker gr…

943da78

…ids, and deprecate redundant tasks in FP16 layer.

Refactor FP16 FFN layers to streamline task graph setup, update worke…

386dddc

…r grid assignments, and enhance attention and FFN block configurations.

Refactor FP16 FFN layers to streamline task graph setup, update worke…

b202bb4

…r grid assignments, and enhance attention and FFN block configurations.

Refactor LogitsFP16Layer task graph to improve readability, optimiz…

3eba3b3

…e kernel setup, and enhance FP16 task processing.

Add fusedFeedForwardWithSiLUAndGLUActivation kernel for HalfFloat a…

2e010b1

…rrays and `mapContextWithQuantizeLogits` kernel, enhancing FP16 computation capabilities

Document Transformer Layer Task Flow for LlamaFP16FFNLayers with de…

4aef300

…tailed data flow, task breakdown, and fusion points

Set default profiler dump directory relative to LLAMA_ROOT when not…

177ec9d

… provided

Add fusedRmsNormFFNGateUp kernel and update FP16 FFN task graph to …

a1c94fb

…incorporate fused RMS normalization, gate, and up-projection

Increase BLOCK_SIZE_C to 16 for Transformer kernel and update FP16 …

577b6b1

…FFN task graphs by removing deprecated tasks, consolidating RMS normalization and FFN operations into `rms_ffn_gate_up`.

Increase ropeWithCacheWorker local work group size to 512 in FP16 F…

d5c1206

…FN layers to optimize worker grid configuration.

Add fused kernels for Qwen3: ropeRotationWithCacheCopy, `fusedQKVMa…

f91108c

…tmul`, and `fusedRmsNormQKVMatmul`. Refactor workers and task graphs to utilize new computations and streamline layer configurations for improved performance and reduced memory transfers.

Merge branch 'feat/deq-n-compute' of github.com:beehive-lab/GPULlama3…

67050bb

….java into feat/deq-n-compute

Add fused Q and K RMSNorm kernel and refactor task graph to consolida…

cfa3ba0

…te Q/K RMSNorm into a single operation. Cleanup deprecated workers, update task names, and streamline layer configuration.

Refactor Qwen3 FP16 FFN layers to streamline worker grid setup, updat…

abf12d4

…e task graphs with fused kernels, reorganize attention and FFN block mapping, and integrate final normalization for non-NVIDIA devices. Add detailed Transformer layer task flow documentation.

Add processHeadsFlashAttentionOptV2 kernel with static memory size …

042b0b5

…fixes, improved attention computation logic, and optimized handling of large models. Update task graph to revert to `processHeadsFlashAttention` for compatibility.

Refactor Qwen3 FP16 FFN layers: remove unused imports, replace explic…

1cbe03a

…it TaskGraph type with `var`, streamline task graph configuration by removing unused temp variables.

Refactor Qwen2 FP16 task graph: consolidate attention and FFN tasks w…

a4bc159

…ith fused kernels, update worker grid configurations, and streamline data transfer logic.

Add fusedQKvBiasAddition kernel, refactor Qwen2 FP16 task graph to …

e15c229

…consolidate Q/K/V bias addition into a single operation, and update worker grid configurations. Streamline attention block with optimized task mapping and detailed layer flow documentation.

mikepapadim requested review from Copilot and orionpapadakis and removed request for Copilot and orionpapadakis December 4, 2025 20:28

Add support for HalfFloatArray in Phi3State and initialize FP16 wrapp…

e7d79c9

…er arrays

mikepapadim and others added 14 commits December 8, 2025 17:32

Refactor FFN layer task graphs: update data transfer logic by removin…

fee6ea4

…g unnecessary `temp` buffers, consolidate transfer flow, and adjust formatting for better readability.

Update Tornado dependencies to version 2.1.0 in pom.xml.

848eae6

Refactor FFN layer setup: streamline task graph configuration, consol…

09cefc1

…idate data transfer logic, replace repetitive patterns with reusable methods, and enhance code readability with improved formatting and comments.

Refactor TransformerComputeKernelsLayered and LlamaQ8_0FFNLayers:…

190c2d4

… condense parameter lists, streamline formatting, optimize fused operations, and introduce Q8 fused methods for enhanced performance.

Refactor Phi3Q8_0FFNLayers and TransformerComputeKernelsLayered: stre…

837211f

…amline task graph setup, enhance fused operations for Q8 quantization, and improve code readability with consistent formatting and comments.

Fuse RoPE rotation and KV cache write into a single task in Qwen3 Q8_…

920db0a

…0 FFN layers.

Rename parallel-attention task to attention and update Flash Attentio…

4091be4

…n kernel arguments in Qwen3 Q8_0 FFN layers.

Fuse RMS normalization and Q/K/V projections into a single task using…

d732362

… the new `fusedRmsNormQKVMatmulQ8_0` kernel in Qwen3 Q8_0 FFN layers.

Fuse Q/K RMS normalization into a single task using the new `fusedQKR…

f10d781

…msNorm` kernel in Qwen3 Q8_0 FFN layers.

Fuse RMS normalization, gate/up projection, SiLU, and GLU into a sing…

5f871f1

…le task using `fusedRmsNormFFNGateUpQ8_0` in Qwen3 Q8_0 FFN layers.

Fuse RoPE rotation and KV cache write into a single task and update a…

2bd11ad

…ssociated worker grids in Qwen2 Q8_0 FFN layers.

Fuse Q/K/V bias addition into a single task in Qwen2 Q8_0 FFN layers.

6ddb592

Fuse RMS normalization and Q/K/V projections into a single task and u…

bc450e2

…pdate worker grids in Qwen2 Q8_0 FFN layers.

Fuse RMS normalization, gate/up projection, SiLU, and GLU into a sing…

87860a3

…le task and update worker grids in Qwen2 Q8_0 FFN layers.

mikepapadim mentioned this pull request Dec 11, 2025

Copy-in embeddings in reduced precision and handle precision conversion during inference #73

Closed

orionpapadakis added 14 commits December 11, 2025 12:00

[refactor] Update rms normalization task

5d9ae41

[refactor] Update attention task

5d4edb7

[refactor] Update attention output projection task in Qwen2 Q8_0 FFN …

2e6aaf5

…layers

[refactor] Replace reductionsOneBlockFFN task with ffn_rms_reduce in …

83bb419

…Qwen2 Q8_0 FFN layers and update worker grids.

[refactor] Replace projectionTwo task with ffn_down_proj in Qwen2 Q8_…

2b27344

…0 FFN layers and update worker grids.

Add final normalization step for non-NVIDIA devices in Qwen2 Q8_0 FFN…

8f637cd

… layers.

Add comments for attention and FFN blocks in Qwen2 Q8_0 FFN layers.

ecb2828

[refactor] Rename matmul1 task to attn_output_proj and update corresp…

eade8f8

…onding worker grid in Qwen3 Q8_0 FFN layers.

[refactor] Replace reductionsOneBlockFFN task with ffn_rms_reduce in …

c480157

…Qwen3 Q8_0 FFN layers and update worker grids.

[refactor] Replace projectionTwo task with ffn_down_proj in Qwen3 Q8_…

4bccf37

…0 FFN layers and update worker grids.

[refactor] Add detailed comments for attention and FFN blocks in Qwen…

05c0048

…3 Q8_0 layers and improve code readability.

Add final normalization step for non-NVIDIA devices in Qwen3 Q8_0 FFN…

06ebbc8

… layers and update worker grids.

[refactor] Rename grid scheduler parameter in Qwen3 Q8_0 FFN layers.

1c2be61

Remove redundant workers in Qwen3 Q8_0 FFN layers.

664f160

mikepapadim merged commit a968bae into main Dec 11, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FP16] Improved performance by fusing dequantize with compute in kernels: 20-30% Inference Speedup #78

[FP16] Improved performance by fusing dequantize with compute in kernels: 20-30% Inference Speedup #78

Uh oh!

mikepapadim commented Dec 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FP16] Improved performance by fusing dequantize with compute in kernels: 20-30% Inference Speedup #78

[FP16] Improved performance by fusing dequantize with compute in kernels: 20-30% Inference Speedup #78

Uh oh!

Conversation

mikepapadim commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Benchmarks (Llama 3.2 1B FP16)

Why This Works

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mikepapadim commented Dec 3, 2025 •

edited

Loading