UMA-Native KV-Cache Benchmarks on M4 Pro 64GB – kv4 Outperforms Unquantized #3134
ainatechnology
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been researching UMA-native approaches to LLM inference on Apple Silicon, specifically focusing on KV-cache memory behavior and quantization trade-offs. With 40 years in software architecture but coming fresh to ML internals, I used AI assistants (Claude, Gemini) extensively for implementation – but the research direction, experimental design, and interpretation are mine.
I'm sharing these results because I found surprisingly little empirical data on actual KV-cache memory scaling and quantization performance on Apple Silicon, despite it being the key constraint for long-context inference on consumer hardware.
Hardware & Software
Key Finding: kv4 Is Free (or Faster)
At 4096 generated tokens, KV-cache quantization via
kv_bits=4ingenerate_stepshows:Phi-3.5-mini (MHA, 32 KV-heads):
GQA is the key factor – Qwen's 4:1 GQA ratio (8 KV-heads vs 40 attention heads) results in half the KV-cache per token compared to Phi-3.5-mini's full MHA, despite being 4x the model size.
Observations
kv4 should arguably be the default for Apple Silicon inference. It provides 3.2x more context capacity at zero or negative performance cost.
SSD offloading is unnecessary for models up to 14B on 64GB hardware, even at extreme context lengths. The memory headroom is enormous.
GQA architecture matters more than model size for KV-cache scaling. When evaluating models for long-context local inference, KV-head count is the critical parameter.
The bottleneck at long context is compute (O(n²) attention), not memory. At 32k tokens with Qwen2.5-14B, we're using only 14GB of 64GB while TPS has already dropped 30%.
Questions for the Community
Note on Methodology
Full transparency: The benchmark script, memory monitoring, and analysis were developed in collaboration with AI assistants (primarily Claude). I designed the experiments, ran them on my hardware, verified the results, and drew the conclusions – but the implementation was AI-assisted. The benchmark code and raw JSON data are available on request.
*Related: This partially answers the unanswered question in [#1808] about running Qwen2.5 with 128k context on MLX – on M4 Pro 64GB with kv4, the theoretical limit for Qwen2.5-14B is ~800k tokens.*
Beta Was this translation helpful? Give feedback.
All reactions