Published: 290 MLX inference benchmarks + 43 perplexity measurements across 10 models on M3 Ultra cluster #3300

guruswami-ai · 2026-03-23T14:35:29Z

guruswami-ai
Mar 23, 2026

Following up on my earlier posts (#2990, #3209) — the full benchmark dataset is now published as a standalone repo with methodology, findings, and cross-platform comparison:

guruswami-ai/mlx-benchmarks

What is new since the last posts

290 data points across 10 models (7B to 1T), 5 quant levels, 5 topologies (SINGLE/EP/TP2/TP4/PP2/PP4), 7 context lengths (1K to 128K)
43 perplexity measurements showing quantisation quality impact — Q2 is catastrophic on most models (Llama 8B Q2 perplexity: 2,132), Q4 is within 2% of Q8
1,353 NVIDIA comparison data points (RTX 3080/4090/5090) via LLM Space Heater for cross-platform analysis
PP patches for Llama and Qwen2, TP+PP patches for Mixtral 8x7B (following established DeepSeek V3 patterns)
Comprehensive documentation: methodology, findings, beginner guides, cluster design

Key findings

Single-node generation hits 95-111% of theoretical bandwidth limits. MLX extracts nearly everything the silicon offers. TPS = bandwidth / model_size predicts measured results within 5%.
MoE models break the naive bandwidth model by 3×. Mixtral predicted at 23 TPS using total params (47B), measured at 69 TPS. Must use active parameter count (13B).
TP scaling efficiency is quant-dependent. Q8 TP2 retains 93% of single-node TPS. Q2 TP2 retains only 52%. Larger model shards per node scale better.
PP is gentler than TP on generation. PP2 loses 4% gen TPS vs TP2 losing 31% (Qwen 32B Q4). PP has fewer sync points.
Kimi K2.5 (1T params) runs at 16 TPS on TP4. Interactive speed on a trillion-parameter model. Four Mac Studios.
DeepSeek V3 (671B) fits on a single M3 Ultra. 380 GB at Q4, 20.2 TPS. No multi-node needed.
Q5 beats Q8 on perplexity for Llama 8B and Mixtral 8x7B. Quantisation acts as regularisation.

What the repo includes

Raw benchmark CSVs for all models
Perplexity data across quant levels
Charts (PNG+SVG) with brand-consistent dark theme
Code patches for distributed inference (Llama PP, Qwen2 PP, Mixtral TP+PP)
RDMA failure modes documentation (6 reproducible TB5 issues)
Beginner guides: Basic Concepts, NVIDIA Consumer Guide, Apple Silicon Guide

This data is also the foundation for a graphical cluster simulator we are building — an interactive tool where you adjust model/quant/topology/context and see real TPS/TTFT/power impact. More on that soon.

Feedback welcome. If we got something wrong or you have data from your own hardware, happy to incorporate it.

guruswami-ai · 2026-03-25T00:53:35Z

guruswami-ai
Mar 25, 2026
Author

Update (March 25): The repo has grown significantly since the initial post.

New content:

Perplexity campaign complete: 42 records across 9 models. Q4 sweet spot confirmed. Q2 cliff varies wildly by architecture (Llama 8B Q2: 2132, Gemma 9B Q2: 337, Mistral 7B Q2: 13.5)
19 documentation guides with 50+ Gemini-generated infographics covering concepts, quantisation, model types, distributed inference, software tools, interconnects
Per-model dashboard pages with 7 chart types each (TPS, TTFT, perplexity, memory, topology, quant comparison, 4-panel overview)
The Yogi Method - brain inference benchmarks (biological learning optimisations for studying AI)

Pipeline parallelism patches offered to mlx-lm: discussion #1051. PP2 loses 4% gen TPS vs TP2 losing 31% on Qwen 32B. Working implementations for Llama, Qwen2, and Mixtral.

RDMA documentation linked on mlx-lm issue #955 - 6 documented TB5 failure modes with fixes.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Published: 290 MLX inference benchmarks + 43 perplexity measurements across 10 models on M3 Ultra cluster #3300

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Published: 290 MLX inference benchmarks + 43 perplexity measurements across 10 models on M3 Ultra cluster #3300

Uh oh!

guruswami-ai Mar 23, 2026

What is new since the last posts

Key findings

What the repo includes

Replies: 1 comment

Uh oh!

guruswami-ai Mar 25, 2026 Author

guruswami-ai
Mar 23, 2026

guruswami-ai
Mar 25, 2026
Author