TB5 RDMA Benchmarks: Pipeline Parallelism Nearly Matches Tensor Parallelism on Kimi-K2 (1T) #2990

guruswami-ai · 2026-01-12T23:01:56Z

guruswami-ai
Jan 12, 2026

Preliminary Findings: TB5 Full-Mesh Cluster with MLX Distributed RDMA/JACCL

Hi guys, I thought I would share some preliminary findings in case it is useful, or you can see I'm doing something fundamentally wrong.

Setup

I have built a five node, fully Thunderbolt 5 (TB5) full-meshed (10× TB5 cables) cluster of M3 Ultra Mac Studios with 512GB RAM each, and set up mlx.distributed to explore the cool new RDMA/JACCL features in OS 26.2. I'm purposefully NOT using Exo, but building this from MLX/ml-explore code.

What I'm Testing

I'm now benchmarking different models in variations of mesh/ring for pipeline and tensor parallelism to see if RDMA/JACCL makes a big difference. It does, but not as much as I thought it would.

The models I'm testing on are massive (Kimi-K2-Thinking ~1T) so splitting over two or more 512GB nodes is required in original quantisation. The catch with RDMA/JACCL to work is the model must be split evenly by the number of nodes, or it won't load. Kimi-K2-Thinking (1T MLX Q4) can be split across 2, 3, or 4 nodes in Tensor Parallelism (TP) RDMA/JACCL. It can happily be split into five nodes for Pipeline Parallelism (PP) though, and run as a ring on 10GbE (slow) or TB5 (much faster).

Surprising Discovery

I was surprised by the results, as I expected TP RDMA/JACCL would be 3× faster than PP ring due to TCP/IP overhead. However, I made a fundamental mistake that TCP/IP would be required for the TB5 ring network—it isn't!

RDMA works in ring mode too, so there is no TCP/IP overhead for PP if you set up the TB5 network correctly using mlx.distributed_config. The whole "bubble" delay in PP still happens, but the RDMA speeds are so fast, it doesn't seem to impact TPS dramatically.

Key Insight: A ring network of TB5-connected nodes with less RAM is likely to outperform a fully-meshed TP cluster, without the expensive TB5 cable spaghetti nightmare.

Kimi-K2-Thinking: PP/TP Benchmark Results

All Configurations Tested

Config	Nodes	TPS	Load Time	Peak Memory	Status
PP5	5	14.45	5.2s	102.5GB	Fastest load
PP4	4	14.49	22.3s	145.9GB	Best balance
TP4	4	14.82	99.9s	185.7GB	Best TPS
TP2	2	13.15	92.6s	345.0GB	Baseline
TP5	5	—	—	—	Failed (12288÷5)

PP4 vs TP4 Head-to-Head

Metric	PP4	TP4	Winner
Throughput	14.49 TPS	14.82 TPS	TP4 (+2.3%)
Load Time	22.3s	99.9s	PP4 (4.5× faster)
Peak Memory	145.9GB	185.7GB	PP4 (21% less)
Dimension constraints	None	Must divide evenly	PP4

Analysis

Throughput is nearly identical — Only 2.3% difference. RDMA makes the all-reduce communication in TP nearly free, so the theoretical advantage of TP (more parallelism per token) barely materializes.
PP loads much faster — PP only loads the layers assigned to each rank (61 layers ÷ 4 = ~15 layers/node). TP loads all weights then shards them at runtime.
PP uses less memory — Each node only holds its pipeline stage. TP holds a slice of every layer, which has more overhead from the sharding metadata.

Recommendations

For 4-node Kimi-K2:

Use PP4 if you need fast model loading or memory efficiency
Use TP4 if you need maximum throughput and can tolerate slow startup

For production inference with model caching, PP4 is likely the better choice since the 2.3% throughput loss is outweighed by the massive load time and memory benefits.

Limitations & Next Steps

These initial benchmarks were run with batch size = 1 and a fixed context length, which may not reveal TP's full potential for throughput-optimized serving scenarios.

Upcoming tests will explore:

Variable context lengths (4K, 8K, 16K, 32K) to see if longer sequences shift the balance toward TP
Larger batch sizes to test throughput under concurrent request loads
Kimi-K2-Instruct — an even larger non-thinking variant — to compare against the thinking model's extended generation patterns

Call for Feedback

Either way, if anyone can see an issue with my thinking or the benchmarks I'm getting please let me know. I am going to set up a series of automated benchmarks to try various combinations of ring, mesh, input token size on a wide variety of models to see what comes out. Massive amounts of RAM on a cluster node is largely wasted, even for these massive models it seems.

The Pain of Getting This Working

Getting this working using mlx.distributed and a custom MLX build was a nightmare—not because of MLX, but because of the numerous hacks required to get OS 26.2 to stop:

Bringing up bridge0 automatically
Spanning Tree Protocol (STP) shutting down TB5 interfaces
enX dynamic interface naming
ARP cache corruption

I now have a collection of dodgy-TB5-mesh-hack.sh scripts I'll release on GitHub if anyone wants to recreate a crazy full-TB5-mesh-AI cluster.

Acknowledgments

The lesson here was the Exo guys did a great job at solving a lot of these issues. They even invented an inter-node P2P all-reduce solution to get around some of the problems I ran into.

However, if you want to run official Apple MLX code on an AI mesh cluster, it is possible. Until better TB5/RDMA/JACCL documentation is available and a bunch of OS patches come out, my dodgy-TB5-mesh-hack.sh scripts will have to do.

guruswami-ai · 2026-01-13T00:06:49Z

guruswami-ai
Jan 13, 2026
Author

Update: Context Length Limitations Discovered

During further testing, I discovered a significant limitation with distributed prefill that dramatically restricts usable context length for pipeline parallelism.

The Problem

When testing longer prompts, I hit Metal GPU timeout errors:

[METAL] Command buffer execution failed: Caused GPU Timeout Error 
(00000002:kIOGPUCommandBufferCallbackErrorTimeout)

Context Length Benchmark Results

Mode	Context	Output	Time	TPS	Memory
PP5	1,046	50	8.4s	5.96	104GB
PP5	1,306	50	9.2s	5.42	104GB
PP5	1,472	50	9.8s	5.09	105GB
PP5	1,523	—	—	TIMEOUT	—
PP4	1,523	50	10.4s	4.81	148GB
PP4	2,073	—	—	TIMEOUT	—
TP4	4,117	50	20.4s	2.45	186GB

Key Findings

PP limit ~1,500 tokens — Metal command buffer timeout limits pipeline parallelism with Kimi-K2 to approximately 1,472–1,523 tokens
TP4 handles 4K+ — Tensor parallelism processes prefill in parallel across all nodes, avoiding sequential timeout accumulation
Root cause confirmed — This is a known MLX issue, not a bug in my setup; the maintainers are aware and investigating
Node stability impact — Repeated GPU timeouts can cause system instability (one of my nodes kernel panicked and required manual reboot)

Why This Happens

Pipeline Parallelism: Each stage must complete prefill for all tokens sequentially before passing activations to the next stage. With 5 stages, the cumulative time exceeds Metal's ~60 second command buffer timeout.

Why TP handles longer contexts: Tensor parallelism processes prefill in parallel across all nodes simultaneously. All nodes work on the same tokens at once, avoiding the per-stage timeout accumulation.

Known Issue

This is acknowledged by the MLX maintainers:

"The GPU timeout issue is a frustrating one.. I'm aware it happens sometimes, but I don't yet have good general solution. I'm looking into options though."
— Awni Hannun, MLX lead

Workarounds

Use TP for long-context workloads — TP4 handles 4K+ tokens without timeout
Run sudo purge before inference to clear memory pressure
Keep prompts under ~1,400 tokens for PP until chunked prefill is implemented

Future Fix

Chunked prefill (breaking prefill into smaller operations that complete within Metal's timeout) would solve this, but it's not yet implemented in mlx.distributed.

Revised Recommendations

Given these findings, the PP vs TP decision depends heavily on your context length requirements:

Use Case	Recommended	Reason
Short prompts (<1,400 tokens)	PP5	Fastest TPS (5–6), lowest memory, 5s load
Medium prompts (1,400–2,000 tokens)	PP4	Slightly higher context ceiling
Long prompts (2,000+ tokens)	TP4	Only option that works
Production with variable prompts	TP4	Avoids timeout failures

Key takeaway: For workloads with prompts >1,500 tokens, use TP4 instead of PP5/PP4. The TP4 load time penalty (78s vs 5s) is worth it for longer context support and avoiding the risk of node instability from repeated GPU timeouts.

⚠️ Warning: GPU Timeouts Can Crash Nodes

During this testing, repeated GPU timeouts caused one of my cluster nodes to kernel panic and become unreachable, requiring a manual power cycle to recover. If you're experimenting with context lengths near the timeout threshold, be prepared for potential system instability. Don't run these tests on production infrastructure.

Request for Feedback

Has anyone else encountered this limitation? I'm looking for:

Confirmation — Are others seeing similar context limits with large models in PP configurations?
Chunked prefill — Is this planned for mlx.distributed? Breaking prefill into smaller operations would solve the timeout issue.
Metal configuration — Are there settings that could extend the command buffer timeout or reduce per-operation latency?
Timeout detection — Is there a way to detect approaching timeout and yield/checkpoint before failure?

Happy to provide additional benchmark data, logs, or test with different configurations if helpful.

0 replies

guruswami-ai · 2026-01-13T03:45:06Z

guruswami-ai
Jan 13, 2026
Author

Kernel Panic Root Cause Analysis

Analysis of 21 kernel panic reports on M3 Ultra Mac Studio revealed:

Triggering Process	Crash Count
`route`	7
`ifconfig`	5
`kernel_task`	4
`python`	1

The crashes are caused by the com.apple.driver.AppleThunderboltRDMA kernel extension (version 0.0.1).

This driver is in very early development and has a null pointer dereference bug that triggers when:

ifconfig configures TB5 network interfaces
route adds routes for TB5 subnets
RDMA queue pairs are initialized/released

The mlx.distributed_config tool uses both ifconfig and route to set up TB5 mesh networking, which directly triggers this kernel bug.

RDMA Resource Leak Hypothesis

Additionally, RDMA queue pairs and memory registrations appear to not be fully released between inference runs:

Each successive run at higher context accumulates unreleased resources
System eventually hits a threshold and crashes
Both nodes in RDMA pair often crash simultaneously
Requires hard reboot (not just process restart)

Mitigation Options

Cold reboot between context tiers — Ensures clean RDMA state
Minimize mesh reconfiguration — Stay on one PP/TP config per session
Use sudo purge — Clear memory pressure between tests
Reported bug to Apple

Bottom line: The TB5 RDMA stack is bleeding-edge (version 0.0.1). These crashes are likely kernel driver bugs. Expect instability until Apple matures this driver.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TB5 RDMA Benchmarks: Pipeline Parallelism Nearly Matches Tensor Parallelism on Kimi-K2 (1T) #2990

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

TB5 RDMA Benchmarks: Pipeline Parallelism Nearly Matches Tensor Parallelism on Kimi-K2 (1T) #2990

Uh oh!

guruswami-ai Jan 12, 2026

Preliminary Findings: TB5 Full-Mesh Cluster with MLX Distributed RDMA/JACCL

Setup

What I'm Testing

Surprising Discovery

Kimi-K2-Thinking: PP/TP Benchmark Results

All Configurations Tested

PP4 vs TP4 Head-to-Head

Analysis

Recommendations

Limitations & Next Steps

Call for Feedback

The Pain of Getting This Working

Acknowledgments

Replies: 2 comments

Uh oh!

guruswami-ai Jan 13, 2026 Author

Update: Context Length Limitations Discovered

The Problem

Context Length Benchmark Results

Key Findings

Why This Happens

Known Issue

Workarounds

Future Fix

Revised Recommendations

⚠️ Warning: GPU Timeouts Can Crash Nodes

Request for Feedback

Uh oh!

guruswami-ai Jan 13, 2026 Author

Kernel Panic Root Cause Analysis

RDMA Resource Leak Hypothesis

Mitigation Options

guruswami-ai
Jan 12, 2026

guruswami-ai
Jan 13, 2026
Author

guruswami-ai
Jan 13, 2026
Author