TB5 RDMA Benchmarks: Pipeline Parallelism Nearly Matches Tensor Parallelism on Kimi-K2 (1T) #2990
Replies: 2 comments
-
Update: Context Length Limitations DiscoveredDuring further testing, I discovered a significant limitation with distributed prefill that dramatically restricts usable context length for pipeline parallelism. The ProblemWhen testing longer prompts, I hit Metal GPU timeout errors: Context Length Benchmark Results
Key Findings
Why This HappensPipeline Parallelism: Each stage must complete prefill for all tokens sequentially before passing activations to the next stage. With 5 stages, the cumulative time exceeds Metal's ~60 second command buffer timeout. Why TP handles longer contexts: Tensor parallelism processes prefill in parallel across all nodes simultaneously. All nodes work on the same tokens at once, avoiding the per-stage timeout accumulation. Known IssueThis is acknowledged by the MLX maintainers:
Workarounds
Future FixChunked prefill (breaking prefill into smaller operations that complete within Metal's timeout) would solve this, but it's not yet implemented in Revised RecommendationsGiven these findings, the PP vs TP decision depends heavily on your context length requirements:
Key takeaway: For workloads with prompts >1,500 tokens, use TP4 instead of PP5/PP4. The TP4 load time penalty (78s vs 5s) is worth it for longer context support and avoiding the risk of node instability from repeated GPU timeouts.
|
Beta Was this translation helpful? Give feedback.
-
Kernel Panic Root Cause AnalysisAnalysis of 21 kernel panic reports on M3 Ultra Mac Studio revealed:
The crashes are caused by the This driver is in very early development and has a null pointer dereference bug that triggers when:
The RDMA Resource Leak HypothesisAdditionally, RDMA queue pairs and memory registrations appear to not be fully released between inference runs:
Mitigation Options
Bottom line: The TB5 RDMA stack is bleeding-edge (version 0.0.1). These crashes are likely kernel driver bugs. Expect instability until Apple matures this driver. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Preliminary Findings: TB5 Full-Mesh Cluster with MLX Distributed RDMA/JACCL
Hi guys, I thought I would share some preliminary findings in case it is useful, or you can see I'm doing something fundamentally wrong.
Setup
I have built a five node, fully Thunderbolt 5 (TB5) full-meshed (10× TB5 cables) cluster of M3 Ultra Mac Studios with 512GB RAM each, and set up
mlx.distributedto explore the cool new RDMA/JACCL features in OS 26.2. I'm purposefully NOT using Exo, but building this from MLX/ml-explore code.What I'm Testing
I'm now benchmarking different models in variations of mesh/ring for pipeline and tensor parallelism to see if RDMA/JACCL makes a big difference. It does, but not as much as I thought it would.
The models I'm testing on are massive (Kimi-K2-Thinking ~1T) so splitting over two or more 512GB nodes is required in original quantisation. The catch with RDMA/JACCL to work is the model must be split evenly by the number of nodes, or it won't load. Kimi-K2-Thinking (1T MLX Q4) can be split across 2, 3, or 4 nodes in Tensor Parallelism (TP) RDMA/JACCL. It can happily be split into five nodes for Pipeline Parallelism (PP) though, and run as a ring on 10GbE (slow) or TB5 (much faster).
Surprising Discovery
I was surprised by the results, as I expected TP RDMA/JACCL would be 3× faster than PP ring due to TCP/IP overhead. However, I made a fundamental mistake that TCP/IP would be required for the TB5 ring network—it isn't!
RDMA works in ring mode too, so there is no TCP/IP overhead for PP if you set up the TB5 network correctly using
mlx.distributed_config. The whole "bubble" delay in PP still happens, but the RDMA speeds are so fast, it doesn't seem to impact TPS dramatically.Kimi-K2-Thinking: PP/TP Benchmark Results
All Configurations Tested
PP4 vs TP4 Head-to-Head
Analysis
Throughput is nearly identical — Only 2.3% difference. RDMA makes the all-reduce communication in TP nearly free, so the theoretical advantage of TP (more parallelism per token) barely materializes.
PP loads much faster — PP only loads the layers assigned to each rank (61 layers ÷ 4 = ~15 layers/node). TP loads all weights then shards them at runtime.
PP uses less memory — Each node only holds its pipeline stage. TP holds a slice of every layer, which has more overhead from the sharding metadata.
Recommendations
For 4-node Kimi-K2:
For production inference with model caching, PP4 is likely the better choice since the 2.3% throughput loss is outweighed by the massive load time and memory benefits.
Limitations & Next Steps
These initial benchmarks were run with batch size = 1 and a fixed context length, which may not reveal TP's full potential for throughput-optimized serving scenarios.
Upcoming tests will explore:
Call for Feedback
Either way, if anyone can see an issue with my thinking or the benchmarks I'm getting please let me know. I am going to set up a series of automated benchmarks to try various combinations of ring, mesh, input token size on a wide variety of models to see what comes out. Massive amounts of RAM on a cluster node is largely wasted, even for these massive models it seems.
The Pain of Getting This Working
Getting this working using
mlx.distributedand a custom MLX build was a nightmare—not because of MLX, but because of the numerous hacks required to get OS 26.2 to stop:bridge0automaticallyenXdynamic interface namingI now have a collection of
dodgy-TB5-mesh-hack.shscripts I'll release on GitHub if anyone wants to recreate a crazy full-TB5-mesh-AI cluster.Acknowledgments
The lesson here was the Exo guys did a great job at solving a lot of these issues. They even invented an inter-node P2P all-reduce solution to get around some of the problems I ran into.
However, if you want to run official Apple MLX code on an AI mesh cluster, it is possible. Until better TB5/RDMA/JACCL documentation is available and a bunch of OS patches come out, my
dodgy-TB5-mesh-hack.shscripts will have to do.Beta Was this translation helpful? Give feedback.
All reactions