[Vulkan] Batch transformer decode into token-level execution regions

## Problem

The Vulkan backend still behaves too much like a generic primitive scheduler during transformer decode. Even after the multi-queue work, decode still incurs too many submit and barrier decisions per token, which leaves substantial CPU-side overhead and prevents the backend from matching inference-oriented runtimes.

This is **not** a duplicate of #1. Issue #1 is about queue plumbing and timeline semaphore support. This issue is about the shape of decode execution on top of that plumbing.

## Why This Matters

The reference runtimes both reduce token latency by recording much larger regions of decode work before submit:

- ggml-vulkan chunks graph execution into larger submission regions
- Zinc tries to record most of one token into one command buffer and one submit

MLX needs a similar decode-oriented execution path rather than making fine-grained primitive-level submit decisions.

## Tasks

- [ ] Add a decode-oriented recording mode that batches most of one token into one or a few compute submission regions
- [ ] Only flush at explicit phase boundaries, readbacks, or hard cross-queue dependencies
- [ ] Track `submits/token`, `barriers/token`, and `transfer submits/token` in bounded profiling output
- [ ] Validate generation correctness on Qwen3/Llama decode workloads after batching changes

## Acceptance Criteria

- Submits per token drop materially on Qwen3 decode
- Barrier count per token drops materially on Qwen3 decode
- No generation correctness regressions on GPU

## References

- `mlx-vulkan-reference-conclusions.md`
- `references/ggml-vulkan-findings.md`
- `references/zinc-findings.md`
- `references/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp` (graph chunking / larger execution regions)
- `references/zinc/src/compute/forward.zig` (decode step recorded as one long token execution region)

## Related

- #1
- #2


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Vulkan] Batch transformer decode into token-level execution regions #14

Problem

Why This Matters

Tasks

Acceptance Criteria

References

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Vulkan] Batch transformer decode into token-level execution regions #14

Description

Problem

Why This Matters

Tasks

Acceptance Criteria

References

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions