-
Notifications
You must be signed in to change notification settings - Fork 0
[Vulkan] Add native decode hot path for attention and KV cache update #16
Copy link
Copy link
Open
Description
Problem
Decode still pays too much for generic primitive execution around attention and KV cache maintenance. Queue and scheduler fixes alone will not close the decode gap if the backend keeps using generic hot-path kernels and generic KV update flows.
This is not a duplicate of #6. Issue #6 focuses on matmul tuning tables and vendor-specific kernel selection. This issue focuses on the decode hot path itself: attention plus KV cache update/append behavior.
Why This Matters
The reference runtimes both invest heavily in inference-shaped hot paths:
- ggml-vulkan has substantial attention specialization and decode-oriented KV behavior
- Zinc hand-codes the token loop around attention, KV write, and immediate consumption
MLX needs a native decode hot path rather than paying repeated generic primitive overhead around these operations.
Tasks
- Add a native or fused decode attention path for the common autoregressive token case
- Add a native KV cache append/update path optimized for decode
- Keep small latency-sensitive decode updates on the compute queue unless bulk transfer overlap is measurably better
- Benchmark Qwen3 decode before/after and validate generation correctness
Acceptance Criteria
- Qwen3 decode throughput improves materially
- Decode traces show fewer copy/sync boundaries around attention + KV work
- No correctness regressions on causal/GQA decode shapes
References
mlx-vulkan-reference-conclusions.mdreferences/ggml-vulkan-findings.mdreferences/zinc-findings.mdreferences/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cppreferences/llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn*.compreferences/zinc/src/compute/attention.zigreferences/zinc/src/compute/forward.zig(attention + KV write sequencing)
Related
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels