-
Notifications
You must be signed in to change notification settings - Fork 24
[FP16] Improved performance by fusing dequantize with compute in kernels: 20-30% Inference Speedup #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…optimized matrix-vector kernels, and SiLU-GLU activation
…6 task graph setup
…ce overhead, improve cache utilization, and update task graph setup to integrate fused kernel.
…k graph to integrate `ropeRotationWithCacheCopy` kernel, and remove redundant kernels (`rope` and `copyToCaches`).
…eprecate `mapContext` and `quantizeXb`
…ids, and deprecate redundant tasks in FP16 layer.
…r grid assignments, and enhance attention and FFN block configurations.
…r grid assignments, and enhance attention and FFN block configurations.
…e kernel setup, and enhance FP16 task processing.
…rrays and `mapContextWithQuantizeLogits` kernel, enhancing FP16 computation capabilities
…tailed data flow, task breakdown, and fusion points
…incorporate fused RMS normalization, gate, and up-projection
…FFN task graphs by removing deprecated tasks, consolidating RMS normalization and FFN operations into `rms_ffn_gate_up`.
…FN layers to optimize worker grid configuration.
…tmul`, and `fusedRmsNormQKVMatmul`. Refactor workers and task graphs to utilize new computations and streamline layer configurations for improved performance and reduced memory transfers.
….java into feat/deq-n-compute
…te Q/K RMSNorm into a single operation. Cleanup deprecated workers, update task names, and streamline layer configuration.
…e task graphs with fused kernels, reorganize attention and FFN block mapping, and integrate final normalization for non-NVIDIA devices. Add detailed Transformer layer task flow documentation.
…fixes, improved attention computation logic, and optimized handling of large models. Update task graph to revert to `processHeadsFlashAttention` for compatibility.
…it TaskGraph type with `var`, streamline task graph configuration by removing unused temp variables.
…ith fused kernels, update worker grid configurations, and streamline data transfer logic.
…consolidate Q/K/V bias addition into a single operation, and update worker grid configurations. Streamline attention block with optimized task mapping and detailed layer flow documentation.
…g unnecessary `temp` buffers, consolidate transfer flow, and adjust formatting for better readability.
…idate data transfer logic, replace repetitive patterns with reusable methods, and enhance code readability with improved formatting and comments.
… condense parameter lists, streamline formatting, optimize fused operations, and introduce Q8 fused methods for enhanced performance.
…amline task graph setup, enhance fused operations for Q8 quantization, and improve code readability with consistent formatting and comments.
…n kernel arguments in Qwen3 Q8_0 FFN layers.
… the new `fusedRmsNormQKVMatmulQ8_0` kernel in Qwen3 Q8_0 FFN layers.
…msNorm` kernel in Qwen3 Q8_0 FFN layers.
…le task using `fusedRmsNormFFNGateUpQ8_0` in Qwen3 Q8_0 FFN layers.
…ssociated worker grids in Qwen2 Q8_0 FFN layers.
…pdate worker grids in Qwen2 Q8_0 FFN layers.
…le task and update worker grids in Qwen2 Q8_0 FFN layers.
…Qwen2 Q8_0 FFN layers and update worker grids.
…0 FFN layers and update worker grids.
…onding worker grid in Qwen3 Q8_0 FFN layers.
…Qwen3 Q8_0 FFN layers and update worker grids.
…0 FFN layers and update worker grids.
…3 Q8_0 layers and improve code readability.
… layers and update worker grids.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Implements fused dequantize-and-compute patterns for quantized matrix-vector operations,
eliminating intermediate memory round-trips during inference.
Changes
avoiding the previous dequantize → store → load → compute pipeline
for the memory-bound decode phase
Benchmarks (Llama 3.2 1B FP16)
Why This Works
Single-token generation is memory-bandwidth bound (matrix-vector ops).
Fusing dequantization with compute hides quantization overhead by keeping
data in registers rather than writing back to memory between operations.