-
-
Notifications
You must be signed in to change notification settings - Fork 792
feat: GPU as async dequant coprocessor for CPU-primary inference #49
Description
Summary
On systems where the GPU is underutilized during CPU-primary inference (e.g. Apple Silicon unified memory, or a discrete GPU sitting mostly idle), it may be worth routing KV-cache dequantization through the GPU on demand while the CPU handles the rest of attention + FFN.
Motivation
TurboQuant's WHT rotation and centroid lookup are GPU-optimized by design. On CPU, the dequant overhead largely cancels out the memory bandwidth savings. But if the GPU is already present and idle, it could act as a dedicated dequant coprocessor:
- KV-cache stays in system RAM (compressed, turbo2/3/4 format)
- GPU is called on-demand for WHT rotation + dequant batches
- CPU handles the rest (attention dot-product, FFN, sampling)
- GPU returns to idle between decode steps
Why this might work
Apple Silicon (unified memory): CPU and GPU share the same physical RAM pool. No PCIe transfer needed; the GPU kernel can operate directly on the compressed KV buffer via Metal shared heap. Overhead is purely kernel launch latency (~5-20µs), which should amortize well at 32K+ context lengths.
Discrete GPU: PCIe bandwidth (~32 GB/s) is a real bottleneck, but at long contexts Sparse V already skips a large fraction of V dequants. With batching across multiple decode steps or heads, the transfer cost per token should be manageable.
Proposed approach
- Shared/zero-copy buffer: CUDA Unified Memory or Metal shared heap for the compressed KV tensor
- Async GPU kernel dispatch: enqueue dequant for the next few K/V rows while CPU processes current attention scores
- Batching: group multiple head dequants into a single kernel launch to amortize overhead
- Sparse V integration: skip GPU dispatch entirely for tokens where attention weight < 1e-6 (already implemented in Sparse V kernel)
Expected benefit
- Keeps GPU power draw low (idle most of the time, burst on KV access)
- Unlocks turbo2/turbo3 quality on CPU-primary setups that previously saw no speed gain
- Particularly attractive for Apple Silicon where the unified memory model makes this near-zero-cost architecturally
Open questions
- Minimum context length where kernel launch overhead amortizes?
- Can ggml's async compute graph accommodate a mixed CPU/GPU dispatch without a full backend rewrite?
- Is Metal command buffer chaining expressive enough for this pattern, or does it need a custom scheduler?
Happy to discuss further or prototype a benchmark if helpful.