Systematic inference benchmarks: 5 models × 6 quants × 7 context lengths on M3 Ultra #3209
guruswami-ai
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
We ran systematic inference benchmarks across 5 models, 6 quantizations, and 7 context lengths on Mac Studio M3 Ultra (512 GB). 276 benchmark runs total. Sharing the raw data and findings since we couldn't find comparable context-scaling data published for Apple Silicon.
Hardware: Mac Studio M3 Ultra (512 GB unified memory, ~800 GB/s bandwidth)
Software: MLX + mlx-lm (JACCL build), macOS 26.3
Method:
mlx_lm.benchmark, 3 trials per config, batch=1, median TPS reported. 256 generation tokens.Key Findings
1. Context length has a much larger impact on TPS than quantization
This was the most surprising result. At 128K context, Q2 is only 1.7× faster than F16 — down from 4.5× at 1K. The KV cache (always FP16) dominates memory bandwidth at long context, making model weight quantization increasingly irrelevant.
Qwen 32B — TPS grid:
At 1K: Q2 is 4.6× F16. At 128K: Q2 is 1.7× F16. All quants converge because ~70%+ of bandwidth is consumed reading the FP16 KV cache.
2. TTFT is compute-bound and quantization-independent
Prefill (prompt processing) time is determined by FLOPs, not memory bandwidth. Quantization does not help:
Llama 405B TTFT — nearly identical across quants:
Formula:
TTFT ≈ 2 × params × context_tokens / TFLOPS. M3 Ultra delivers ~54 TFLOPS FP16. At 405B × 16K tokens, that's ~10 minutes regardless of quantization.3. MoE changes everything
Mixtral 8x7B (47B total, 12.9B active) vs Qwen 32B (32B dense):
2.2× faster generation AND 2.2× faster prefill despite being a "larger" model. Only 12.9B active parameters are read per token.
Mixtral 8x7B — TPS grid:
4. Dense models >100B are impractical for interactive use
Llama 405B — TPS grid:
F16 (810 GB) doesn't fit. Q8 at 32K was OOM killed. Even Q2 at 5 TPS is marginally readable. TTFT at any context >4K means minutes of waiting.
5. Kimi K2.5 (1T MoE, 32B active)
Only INT4 quantization available from source. 612 GB model requires the full 512 GB node. 11 TPS at short context is usable but not fast.
6. Prompt TPS degrades at long context (quadratic attention)
Even prefill slows down at longer sequences:
(Qwen 32B data. O(n²) attention dominates at long sequences.)
7. 512 GB is overkill for single-model inference
The practical sweet spot (~30B Q4) uses 17-27 GB — 5% of available memory. A 64 GB M4 Pro Mac Mini would deliver identical TPS since performance is bandwidth-limited, not capacity-limited. The 512 GB is valuable for research, multi-model serving, fine-tuning, and embedding workloads — not single-model inference.
Methodology Notes
Raw Data
All 276 benchmark results are available as JSONL at guruswami-ai/chakra/benchmarks/results.
Happy to answer questions or run additional configs if there are gaps the community wants filled. Distributed (TP2/TP4 over TB5 RDMA) and batch-size benchmarks are next.
Beta Was this translation helpful? Give feedback.
All reactions