feat: GPU as async dequant coprocessor for CPU-primary inference

## Summary

On systems where the GPU is underutilized during CPU-primary inference (e.g. Apple Silicon unified memory, or a discrete GPU sitting mostly idle), it may be worth routing KV-cache dequantization through the GPU on demand while the CPU handles the rest of attention + FFN.

## Motivation

TurboQuant's WHT rotation and centroid lookup are GPU-optimized by design. On CPU, the dequant overhead largely cancels out the memory bandwidth savings. But if the GPU is already present and idle, it could act as a dedicated dequant coprocessor:

- **KV-cache stays in system RAM** (compressed, turbo2/3/4 format)
- **GPU is called on-demand** for WHT rotation + dequant batches
- **CPU handles** the rest (attention dot-product, FFN, sampling)
- **GPU returns to idle** between decode steps

## Why this might work

**Apple Silicon (unified memory):** CPU and GPU share the same physical RAM pool. No PCIe transfer needed; the GPU kernel can operate directly on the compressed KV buffer via Metal shared heap. Overhead is purely kernel launch latency (~5-20µs), which should amortize well at 32K+ context lengths.

**Discrete GPU:** PCIe bandwidth (~32 GB/s) is a real bottleneck, but at long contexts Sparse V already skips a large fraction of V dequants. With batching across multiple decode steps or heads, the transfer cost per token should be manageable.

## Proposed approach

1. **Shared/zero-copy buffer**: CUDA Unified Memory or Metal shared heap for the compressed KV tensor
2. **Async GPU kernel dispatch**: enqueue dequant for the next few K/V rows while CPU processes current attention scores
3. **Batching**: group multiple head dequants into a single kernel launch to amortize overhead
4. **Sparse V integration**: skip GPU dispatch entirely for tokens where attention weight < 1e-6 (already implemented in Sparse V kernel)

## Expected benefit

- Keeps GPU power draw low (idle most of the time, burst on KV access)
- Unlocks turbo2/turbo3 quality on CPU-primary setups that previously saw no speed gain
- Particularly attractive for Apple Silicon where the unified memory model makes this near-zero-cost architecturally

## Open questions

- Minimum context length where kernel launch overhead amortizes?
- Can ggml's async compute graph accommodate a mixed CPU/GPU dispatch without a full backend rewrite?
- Is Metal command buffer chaining expressive enough for this pattern, or does it need a custom scheduler?

Happy to discuss further or prototype a benchmark if helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: GPU as async dequant coprocessor for CPU-primary inference #49

Summary

Motivation

Why this might work

Proposed approach

Expected benefit

Open questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

feat: GPU as async dequant coprocessor for CPU-primary inference #49

Description

Summary

Motivation

Why this might work

Proposed approach

Expected benefit

Open questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions