Published: 290 MLX inference benchmarks + 43 perplexity measurements across 10 models on M3 Ultra cluster #3300
guruswami-ai
started this conversation in
Show and tell
Replies: 1 comment
-
|
Update (March 25): The repo has grown significantly since the initial post. New content:
Pipeline parallelism patches offered to mlx-lm: discussion #1051. PP2 loses 4% gen TPS vs TP2 losing 31% on Qwen 32B. Working implementations for Llama, Qwen2, and Mixtral. RDMA documentation linked on mlx-lm issue #955 - 6 documented TB5 failure modes with fixes. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Following up on my earlier posts (#2990, #3209) — the full benchmark dataset is now published as a standalone repo with methodology, findings, and cross-platform comparison:
guruswami-ai/mlx-benchmarks
What is new since the last posts
Key findings
Single-node generation hits 95-111% of theoretical bandwidth limits. MLX extracts nearly everything the silicon offers.
TPS = bandwidth / model_sizepredicts measured results within 5%.MoE models break the naive bandwidth model by 3×. Mixtral predicted at 23 TPS using total params (47B), measured at 69 TPS. Must use active parameter count (13B).
TP scaling efficiency is quant-dependent. Q8 TP2 retains 93% of single-node TPS. Q2 TP2 retains only 52%. Larger model shards per node scale better.
PP is gentler than TP on generation. PP2 loses 4% gen TPS vs TP2 losing 31% (Qwen 32B Q4). PP has fewer sync points.
Kimi K2.5 (1T params) runs at 16 TPS on TP4. Interactive speed on a trillion-parameter model. Four Mac Studios.
DeepSeek V3 (671B) fits on a single M3 Ultra. 380 GB at Q4, 20.2 TPS. No multi-node needed.
Q5 beats Q8 on perplexity for Llama 8B and Mixtral 8x7B. Quantisation acts as regularisation.
What the repo includes
This data is also the foundation for a graphical cluster simulator we are building — an interactive tool where you adjust model/quant/topology/context and see real TPS/TTFT/power impact. More on that soon.
Feedback welcome. If we got something wrong or you have data from your own hardware, happy to incorporate it.
Beta Was this translation helpful? Give feedback.
All reactions