Skip to content

[Vulkan] Batch transformer decode into token-level execution regions #14

@goniz

Description

@goniz

Problem

The Vulkan backend still behaves too much like a generic primitive scheduler during transformer decode. Even after the multi-queue work, decode still incurs too many submit and barrier decisions per token, which leaves substantial CPU-side overhead and prevents the backend from matching inference-oriented runtimes.

This is not a duplicate of #1. Issue #1 is about queue plumbing and timeline semaphore support. This issue is about the shape of decode execution on top of that plumbing.

Why This Matters

The reference runtimes both reduce token latency by recording much larger regions of decode work before submit:

  • ggml-vulkan chunks graph execution into larger submission regions
  • Zinc tries to record most of one token into one command buffer and one submit

MLX needs a similar decode-oriented execution path rather than making fine-grained primitive-level submit decisions.

Tasks

  • Add a decode-oriented recording mode that batches most of one token into one or a few compute submission regions
  • Only flush at explicit phase boundaries, readbacks, or hard cross-queue dependencies
  • Track submits/token, barriers/token, and transfer submits/token in bounded profiling output
  • Validate generation correctness on Qwen3/Llama decode workloads after batching changes

Acceptance Criteria

  • Submits per token drop materially on Qwen3 decode
  • Barrier count per token drops materially on Qwen3 decode
  • No generation correctness regressions on GPU

References

  • mlx-vulkan-reference-conclusions.md
  • references/ggml-vulkan-findings.md
  • references/zinc-findings.md
  • references/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp (graph chunking / larger execution regions)
  • references/zinc/src/compute/forward.zig (decode step recorded as one long token execution region)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions