Systematic inference benchmarks: 5 models × 6 quants × 7 context lengths on M3 Ultra #3209

guruswami-ai · 2026-03-05T11:15:28Z

guruswami-ai
Mar 5, 2026

Overview

We ran systematic inference benchmarks across 5 models, 6 quantizations, and 7 context lengths on Mac Studio M3 Ultra (512 GB). 276 benchmark runs total. Sharing the raw data and findings since we couldn't find comparable context-scaling data published for Apple Silicon.

Hardware: Mac Studio M3 Ultra (512 GB unified memory, ~800 GB/s bandwidth)
Software: MLX + mlx-lm (JACCL build), macOS 26.3
Method: mlx_lm.benchmark, 3 trials per config, batch=1, median TPS reported. 256 generation tokens.

Key Findings

1. Context length has a much larger impact on TPS than quantization

This was the most surprising result. At 128K context, Q2 is only 1.7× faster than F16 — down from 4.5× at 1K. The KV cache (always FP16) dominates memory bandwidth at long context, making model weight quantization increasingly irrelevant.

Qwen 32B — TPS grid:

            1K      4K      8K     16K     32K     64K    128K
  F16    10.4    10.2     9.8     9.4     8.5     7.1     5.5
   Q8    18.3    17.7    16.9    15.5    13.4    10.3     7.1
   Q6    23.0    22.0    20.9    18.8    15.7    11.6     7.7
   Q4    31.2    29.5    27.5    23.9    19.0    13.4     8.5
   Q3    38.1    35.4    32.1    27.4    21.1    14.4     8.9
   Q2    47.6    43.9    39.6    32.7    24.1    15.6     9.3

At 1K: Q2 is 4.6× F16. At 128K: Q2 is 1.7× F16. All quants converge because ~70%+ of bandwidth is consumed reading the FP16 KV cache.

2. TTFT is compute-bound and quantization-independent

Prefill (prompt processing) time is determined by FLOPs, not memory bandwidth. Quantization does not help:

Llama 405B TTFT — nearly identical across quants:

Context	Q2	Q4	Q8
1K	37s	37s	42s
4K	152s	165s	157s
8K	5.1 min	5.3 min	5.4 min
16K	10.3 min	10.3 min	10.8 min

Formula: TTFT ≈ 2 × params × context_tokens / TFLOPS. M3 Ultra delivers ~54 TFLOPS FP16. At 405B × 16K tokens, that's ~10 minutes regardless of quantization.

3. MoE changes everything

Mixtral 8x7B (47B total, 12.9B active) vs Qwen 32B (32B dense):

	Mixtral Q4	Qwen Q4
TPS @ 1K	68.4	31.2
TPS @ 32K	46.7	19.0
TTFT @ 32K	55s	121s

2.2× faster generation AND 2.2× faster prefill despite being a "larger" model. Only 12.9B active parameters are read per token.

Mixtral 8x7B — TPS grid:

            1K      4K      8K     16K     32K
  F16    24.3    23.9    23.4    22.5    21.0
   Q8    42.0    40.9    39.3    36.9    32.8
   Q6    51.0    49.3    47.4    43.7    37.8
   Q4    68.4    65.6    61.8    56.0    46.7
   Q3    78.4    74.3    69.7    62.2    50.8
   Q2    99.2    92.4    85.3    74.7    59.5

4. Dense models >100B are impractical for interactive use

Llama 405B — TPS grid:

            1K      4K      8K     16K     32K     64K
   Q8     1.6     1.6     1.6     1.5       —     1.3
   Q6     2.1     2.0     2.0     1.9     1.8     1.6
   Q4     2.9     2.9     2.8     2.7     2.5     2.1
   Q3     3.6     3.6     3.4     3.3     3.0     2.5
   Q2     5.1     4.9     4.7     4.4       —       —

F16 (810 GB) doesn't fit. Q8 at 32K was OOM killed. Even Q2 at 5 TPS is marginally readable. TTFT at any context >4K means minutes of waiting.

5. Kimi K2.5 (1T MoE, 32B active)

            1K      4K      8K     16K     32K     64K    128K
   Q4    11.1    10.3     9.8     8.6     7.2     5.5     3.8

Only INT4 quantization available from source. 612 GB model requires the full 512 GB node. 11 TPS at short context is usable but not fast.

6. Prompt TPS degrades at long context (quadratic attention)

Even prefill slows down at longer sequences:

Context	Prompt TPS	Drop
1K	~345	baseline
8K	~333	-3%
32K	~271	-21%
128K	~154	-55%

(Qwen 32B data. O(n²) attention dominates at long sequences.)

7. 512 GB is overkill for single-model inference

The practical sweet spot (~30B Q4) uses 17-27 GB — 5% of available memory. A 64 GB M4 Pro Mac Mini would deliver identical TPS since performance is bandwidth-limited, not capacity-limited. The 512 GB is valuable for research, multi-model serving, fine-tuning, and embedding workloads — not single-model inference.

Methodology Notes

Corrupted data incident: Early runs after a 405B Q8 session (433 GB) caused OOM kills. macOS didn't fully release wired memory, producing 10-24× lower TPS. 9 records re-run after node reboot.
TTFT ceiling: We added a MAX_TTFT=600s filter — configs with estimated TTFT >10 min were skipped as impractical.
All batch=1: These are interactive/chat workload numbers. Batch>1 and distributed (TP/PP) benchmarks are planned as a follow-up.

Raw Data

All 276 benchmark results are available as JSONL at guruswami-ai/chakra/benchmarks/results.

Happy to answer questions or run additional configs if there are gaps the community wants filled. Distributed (TP2/TP4 over TB5 RDMA) and batch-size benchmarks are next.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Systematic inference benchmarks: 5 models × 6 quants × 7 context lengths on M3 Ultra #3209

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Systematic inference benchmarks: 5 models × 6 quants × 7 context lengths on M3 Ultra #3209

Uh oh!

Uh oh!

guruswami-ai Mar 5, 2026

Overview

Key Findings

1. Context length has a much larger impact on TPS than quantization

2. TTFT is compute-bound and quantization-independent

3. MoE changes everything

4. Dense models >100B are impractical for interactive use

5. Kimi K2.5 (1T MoE, 32B active)

6. Prompt TPS degrades at long context (quadratic attention)

7. 512 GB is overkill for single-model inference

Methodology Notes

Raw Data

Replies: 0 comments

guruswami-ai
Mar 5, 2026