-
Notifications
You must be signed in to change notification settings - Fork 0
[Vulkan] Batch transformer decode into token-level execution regions #14
Copy link
Copy link
Open
Description
Problem
The Vulkan backend still behaves too much like a generic primitive scheduler during transformer decode. Even after the multi-queue work, decode still incurs too many submit and barrier decisions per token, which leaves substantial CPU-side overhead and prevents the backend from matching inference-oriented runtimes.
This is not a duplicate of #1. Issue #1 is about queue plumbing and timeline semaphore support. This issue is about the shape of decode execution on top of that plumbing.
Why This Matters
The reference runtimes both reduce token latency by recording much larger regions of decode work before submit:
- ggml-vulkan chunks graph execution into larger submission regions
- Zinc tries to record most of one token into one command buffer and one submit
MLX needs a similar decode-oriented execution path rather than making fine-grained primitive-level submit decisions.
Tasks
- Add a decode-oriented recording mode that batches most of one token into one or a few compute submission regions
- Only flush at explicit phase boundaries, readbacks, or hard cross-queue dependencies
- Track
submits/token,barriers/token, andtransfer submits/tokenin bounded profiling output - Validate generation correctness on Qwen3/Llama decode workloads after batching changes
Acceptance Criteria
- Submits per token drop materially on Qwen3 decode
- Barrier count per token drops materially on Qwen3 decode
- No generation correctness regressions on GPU
References
mlx-vulkan-reference-conclusions.mdreferences/ggml-vulkan-findings.mdreferences/zinc-findings.mdreferences/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp(graph chunking / larger execution regions)references/zinc/src/compute/forward.zig(decode step recorded as one long token execution region)
Related
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels