Skip to content

feat: GPU as async dequant coprocessor for CPU-primary inference #49

@homeofe

Description

@homeofe

Summary

On systems where the GPU is underutilized during CPU-primary inference (e.g. Apple Silicon unified memory, or a discrete GPU sitting mostly idle), it may be worth routing KV-cache dequantization through the GPU on demand while the CPU handles the rest of attention + FFN.

Motivation

TurboQuant's WHT rotation and centroid lookup are GPU-optimized by design. On CPU, the dequant overhead largely cancels out the memory bandwidth savings. But if the GPU is already present and idle, it could act as a dedicated dequant coprocessor:

  • KV-cache stays in system RAM (compressed, turbo2/3/4 format)
  • GPU is called on-demand for WHT rotation + dequant batches
  • CPU handles the rest (attention dot-product, FFN, sampling)
  • GPU returns to idle between decode steps

Why this might work

Apple Silicon (unified memory): CPU and GPU share the same physical RAM pool. No PCIe transfer needed; the GPU kernel can operate directly on the compressed KV buffer via Metal shared heap. Overhead is purely kernel launch latency (~5-20µs), which should amortize well at 32K+ context lengths.

Discrete GPU: PCIe bandwidth (~32 GB/s) is a real bottleneck, but at long contexts Sparse V already skips a large fraction of V dequants. With batching across multiple decode steps or heads, the transfer cost per token should be manageable.

Proposed approach

  1. Shared/zero-copy buffer: CUDA Unified Memory or Metal shared heap for the compressed KV tensor
  2. Async GPU kernel dispatch: enqueue dequant for the next few K/V rows while CPU processes current attention scores
  3. Batching: group multiple head dequants into a single kernel launch to amortize overhead
  4. Sparse V integration: skip GPU dispatch entirely for tokens where attention weight < 1e-6 (already implemented in Sparse V kernel)

Expected benefit

  • Keeps GPU power draw low (idle most of the time, burst on KV access)
  • Unlocks turbo2/turbo3 quality on CPU-primary setups that previously saw no speed gain
  • Particularly attractive for Apple Silicon where the unified memory model makes this near-zero-cost architecturally

Open questions

  • Minimum context length where kernel launch overhead amortizes?
  • Can ggml's async compute graph accommodate a mixed CPU/GPU dispatch without a full backend rewrite?
  • Is Metal command buffer chaining expressive enough for this pattern, or does it need a custom scheduler?

Happy to discuss further or prototype a benchmark if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions