UMA-Native KV-Cache Benchmarks on M4 Pro 64GB – kv4 Outperforms Unquantized #3134

ainatechnology · 2026-02-16T15:12:50Z

ainatechnology
Feb 16, 2026

I've been researching UMA-native approaches to LLM inference on Apple Silicon, specifically focusing on KV-cache memory behavior and quantization trade-offs. With 40 years in software architecture but coming fresh to ML internals, I used AI assistants (Claude, Gemini) extensively for implementation – but the research direction, experimental design, and interpretation are mine.

I'm sharing these results because I found surprisingly little empirical data on actual KV-cache memory scaling and quantization performance on Apple Silicon, despite it being the key constraint for long-context inference on consumer hardware.

Hardware & Software

Mac Mini M4 Pro, 64 GB Unified Memory
mlx-lm 0.30.7, Python 3.13
Models: Phi-3.5-mini-instruct-4bit (3.8B), Qwen2.5-14B-Instruct-4bit (14B)

Key Finding: kv4 Is Free (or Faster)

At 4096 generated tokens, KV-cache quantization via kv_bits=4 in generate_step shows:

Phi-3.5-mini (MHA, 32 KV-heads):

KV Mode	Cache Growth	MB/1kT	TPS	vs Baseline
none	1.60 GB	404	71.4	—
kv8	0.91 GB	226	67.9	-4.9%
kv4	0.51 GB	127	72.2	+1.1%

GQA is the key factor – Qwen's 4:1 GQA ratio (8 KV-heads vs 40 attention heads) results in half the KV-cache per token compared to Phi-3.5-mini's full MHA, despite being 4x the model size.

Observations

kv4 should arguably be the default for Apple Silicon inference. It provides 3.2x more context capacity at zero or negative performance cost.
SSD offloading is unnecessary for models up to 14B on 64GB hardware, even at extreme context lengths. The memory headroom is enormous.
GQA architecture matters more than model size for KV-cache scaling. When evaluating models for long-context local inference, KV-head count is the critical parameter.
The bottleneck at long context is compute (O(n²) attention), not memory. At 32k tokens with Qwen2.5-14B, we're using only 14GB of 64GB while TPS has already dropped 30%.

Questions for the Community

Is the kv4 speed advantage on Apple Silicon a known effect? I couldn't find prior measurements.
Are there plans to make kv4 the default or auto-enabled when memory headroom exists?
For the O(n²) attention bottleneck at long contexts: is FlashAttention-style chunked attention on the MLX roadmap, or does Metal 4's MTLTensor API (ref: discussion Lower sorted QMM gather threshold #2609) change the approach?

Note on Methodology

Full transparency: The benchmark script, memory monitoring, and analysis were developed in collaboration with AI assistants (primarily Claude). I designed the experiments, ran them on my hardware, verified the results, and drew the conclusions – but the implementation was AI-assisted. The benchmark code and raw JSON data are available on request.

*Related: This partially answers the unanswered question in [#1808] about running Qwen2.5 with 128k context on MLX – on M4 Pro 64GB with kv4, the theoretical limit for Qwen2.5-14B is ~800k tokens.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMA-Native KV-Cache Benchmarks on M4 Pro 64GB – kv4 Outperforms Unquantized #3134

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

UMA-Native KV-Cache Benchmarks on M4 Pro 64GB – kv4 Outperforms Unquantized #3134

Uh oh!

Uh oh!

ainatechnology Feb 16, 2026

Hardware & Software

Key Finding: kv4 Is Free (or Faster)

Observations

Questions for the Community

Note on Methodology

Replies: 0 comments

ainatechnology
Feb 16, 2026