Summary
MoE models (Qwen3.5-35B-A3B) generate at 11.7 T/s in mlx-swift-lm vs 85 T/s (MLX-internal) in Python mlx-lm — a 7.3x slowdown. Dense models (Qwen3-8B) show only a 1.33x gap (52.8 vs 70.2 T/s), confirming the issue is MoE-specific.
Environment
- Hardware: Apple M4 Max, 96 GB unified memory
- mlx-swift: 0.30.6
- mlx-swift-lm: commit 11968af (latest HEAD)
- Python mlx-lm: 0.30.7, mlx 0.31.0
- Model:
NexVeridian/Qwen3.5-35B-A3B-4bit (MoE, 35B total / 3B active per token)
Reproduction
Swift (11.7 T/s)
let config = ModelConfiguration(id: "NexVeridian/Qwen3.5-35B-A3B-4bit")
let container = try await LLMModelFactory.shared.loadContainer(configuration: config)
let input = UserInput(chat: [
.system("/no_think\nYou are a helpful assistant."),
.user("What is the weather like today?"),
])
let lmInput = try await container.prepare(input: input)
let params = GenerateParameters(maxTokens: 128, temperature: 0.7)
let stream = try await container.generate(input: lmInput, parameters: params)
for await generation in stream {
// info.tokensPerSecond reports ~11.7 T/s
}
Python (85 T/s)
from mlx_lm import load, generate
model, tokenizer = load("NexVeridian/Qwen3.5-35B-A3B-4bit")
# Verbose output shows: Generation: 128 tokens, 85.094 tokens-per-sec
output = generate(model, tokenizer, prompt=prompt_text, max_tokens=128, verbose=True)
Verified measurements
| Model |
Type |
Python mlx-lm (internal T/s) |
Swift mlx-swift-lm (internal T/s) |
Ratio |
| Qwen3-8B |
Dense |
70.2 |
52.8 |
1.33x |
| Qwen3.5-35B-A3B |
MoE |
85.1 |
11.7 |
7.3x |
Python numbers verified across 3 consistent runs (72 T/s wall-time, 85 T/s MLX-internal).
Root cause analysis
We traced the bottleneck to three patterns in the Swift generation loop:
1. Global evalLock serializes all GPU operations
Transforms+Eval.swift:9 — every eval() and asyncEval() acquires a global NSRecursiveLock:
let evalLock = NSRecursiveLock()
public func asyncEval(_ arrays: some Collection<MLXArray>) {
let vector_array = new_mlx_vector_array(arrays)
_ = evalLock.withLock {
mlx_async_eval(vector_array)
}
mlx_vector_array_free(vector_array)
}
MoE models perform many more intermediate operations per token (expert gating, routing, multiple expert MLPs, output merging), each acquiring this lock. Dense models do far fewer operations per token, so the lock overhead is proportionally smaller.
2. Per-token GPU→CPU sync in TokenIterator.next()
Evaluate.swift:451 — extracting the integer token value forces synchronization:
return previousY.tokens.item(Int.self) // Forces GPU→CPU sync every token
Python's generate_step uses mx.async_eval() to pipeline the next token computation while the current token is being extracted, avoiding this serial bottleneck.
3. No stream context for kernel fusion
Python wraps generation in mx.stream() contexts that enable implicit kernel fusion across operations. Swift has no equivalent — each operation is scheduled individually, preventing the MLX compiler from fusing MoE routing operations.
Why MoE is disproportionately affected
- Dense model: ~8-10 major operations per forward pass per token
- MoE model: ~30-40 operations per forward pass (expert gate + N expert MLPs + merge)
- Each operation hits the
evalLock and triggers individual GPU scheduling
- The 7.3x gap roughly matches the ratio of operations per token (MoE vs dense)
Possible fixes
- Remove or weaken
evalLock — if mlx_async_eval is thread-safe at the C level, the Swift lock may be unnecessary
- Add stream context support — equivalent to Python's
mx.stream() for kernel fusion
- Pipeline token extraction — schedule next token computation before extracting current token integer value
- Batch expert operations — reduce the number of individual eval calls per MoE forward pass
Impact
We use mlx-swift-lm as the inference backend for Fae, a macOS voice assistant. The MoE slowdown prevents us from using Qwen3.5-35B-A3B at its full potential on 64+ GB Apple Silicon machines. At Python-equivalent speeds (~85 T/s), this model would be excellent for real-time voice interaction.
Summary
MoE models (Qwen3.5-35B-A3B) generate at 11.7 T/s in mlx-swift-lm vs 85 T/s (MLX-internal) in Python mlx-lm — a 7.3x slowdown. Dense models (Qwen3-8B) show only a 1.33x gap (52.8 vs 70.2 T/s), confirming the issue is MoE-specific.
Environment
NexVeridian/Qwen3.5-35B-A3B-4bit(MoE, 35B total / 3B active per token)Reproduction
Swift (11.7 T/s)
Python (85 T/s)
Verified measurements
Python numbers verified across 3 consistent runs (72 T/s wall-time, 85 T/s MLX-internal).
Root cause analysis
We traced the bottleneck to three patterns in the Swift generation loop:
1. Global
evalLockserializes all GPU operationsTransforms+Eval.swift:9— everyeval()andasyncEval()acquires a globalNSRecursiveLock:MoE models perform many more intermediate operations per token (expert gating, routing, multiple expert MLPs, output merging), each acquiring this lock. Dense models do far fewer operations per token, so the lock overhead is proportionally smaller.
2. Per-token GPU→CPU sync in
TokenIterator.next()Evaluate.swift:451— extracting the integer token value forces synchronization:Python's
generate_stepusesmx.async_eval()to pipeline the next token computation while the current token is being extracted, avoiding this serial bottleneck.3. No stream context for kernel fusion
Python wraps generation in
mx.stream()contexts that enable implicit kernel fusion across operations. Swift has no equivalent — each operation is scheduled individually, preventing the MLX compiler from fusing MoE routing operations.Why MoE is disproportionately affected
evalLockand triggers individual GPU schedulingPossible fixes
evalLock— ifmlx_async_evalis thread-safe at the C level, the Swift lock may be unnecessarymx.stream()for kernel fusionImpact
We use mlx-swift-lm as the inference backend for Fae, a macOS voice assistant. The MoE slowdown prevents us from using Qwen3.5-35B-A3B at its full potential on 64+ GB Apple Silicon machines. At Python-equivalent speeds (~85 T/s), this model would be excellent for real-time voice interaction.