Skip to content

[Vulkan] Add native decode hot path for attention and KV cache update #16

@goniz

Description

@goniz

Problem

Decode still pays too much for generic primitive execution around attention and KV cache maintenance. Queue and scheduler fixes alone will not close the decode gap if the backend keeps using generic hot-path kernels and generic KV update flows.

This is not a duplicate of #6. Issue #6 focuses on matmul tuning tables and vendor-specific kernel selection. This issue focuses on the decode hot path itself: attention plus KV cache update/append behavior.

Why This Matters

The reference runtimes both invest heavily in inference-shaped hot paths:

  • ggml-vulkan has substantial attention specialization and decode-oriented KV behavior
  • Zinc hand-codes the token loop around attention, KV write, and immediate consumption

MLX needs a native decode hot path rather than paying repeated generic primitive overhead around these operations.

Tasks

  • Add a native or fused decode attention path for the common autoregressive token case
  • Add a native KV cache append/update path optimized for decode
  • Keep small latency-sensitive decode updates on the compute queue unless bulk transfer overlap is measurably better
  • Benchmark Qwen3 decode before/after and validate generation correctness

Acceptance Criteria

  • Qwen3 decode throughput improves materially
  • Decode traces show fewer copy/sync boundaries around attention + KV work
  • No correctness regressions on causal/GQA decode shapes

References

  • mlx-vulkan-reference-conclusions.md
  • references/ggml-vulkan-findings.md
  • references/zinc-findings.md
  • references/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp
  • references/llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/flash_attn*.comp
  • references/zinc/src/compute/attention.zig
  • references/zinc/src/compute/forward.zig (attention + KV write sequencing)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions