Skip to content

MoE models 7.3x slower than Python mlx-lm (per-token sync + evalLock) #124

@dirvine

Description

@dirvine

Summary

MoE models (Qwen3.5-35B-A3B) generate at 11.7 T/s in mlx-swift-lm vs 85 T/s (MLX-internal) in Python mlx-lm — a 7.3x slowdown. Dense models (Qwen3-8B) show only a 1.33x gap (52.8 vs 70.2 T/s), confirming the issue is MoE-specific.

Environment

  • Hardware: Apple M4 Max, 96 GB unified memory
  • mlx-swift: 0.30.6
  • mlx-swift-lm: commit 11968af (latest HEAD)
  • Python mlx-lm: 0.30.7, mlx 0.31.0
  • Model: NexVeridian/Qwen3.5-35B-A3B-4bit (MoE, 35B total / 3B active per token)

Reproduction

Swift (11.7 T/s)

let config = ModelConfiguration(id: "NexVeridian/Qwen3.5-35B-A3B-4bit")
let container = try await LLMModelFactory.shared.loadContainer(configuration: config)
let input = UserInput(chat: [
    .system("/no_think\nYou are a helpful assistant."),
    .user("What is the weather like today?"),
])
let lmInput = try await container.prepare(input: input)
let params = GenerateParameters(maxTokens: 128, temperature: 0.7)
let stream = try await container.generate(input: lmInput, parameters: params)
for await generation in stream {
    // info.tokensPerSecond reports ~11.7 T/s
}

Python (85 T/s)

from mlx_lm import load, generate
model, tokenizer = load("NexVeridian/Qwen3.5-35B-A3B-4bit")
# Verbose output shows: Generation: 128 tokens, 85.094 tokens-per-sec
output = generate(model, tokenizer, prompt=prompt_text, max_tokens=128, verbose=True)

Verified measurements

Model Type Python mlx-lm (internal T/s) Swift mlx-swift-lm (internal T/s) Ratio
Qwen3-8B Dense 70.2 52.8 1.33x
Qwen3.5-35B-A3B MoE 85.1 11.7 7.3x

Python numbers verified across 3 consistent runs (72 T/s wall-time, 85 T/s MLX-internal).

Root cause analysis

We traced the bottleneck to three patterns in the Swift generation loop:

1. Global evalLock serializes all GPU operations

Transforms+Eval.swift:9 — every eval() and asyncEval() acquires a global NSRecursiveLock:

let evalLock = NSRecursiveLock()

public func asyncEval(_ arrays: some Collection<MLXArray>) {
    let vector_array = new_mlx_vector_array(arrays)
    _ = evalLock.withLock {
        mlx_async_eval(vector_array)
    }
    mlx_vector_array_free(vector_array)
}

MoE models perform many more intermediate operations per token (expert gating, routing, multiple expert MLPs, output merging), each acquiring this lock. Dense models do far fewer operations per token, so the lock overhead is proportionally smaller.

2. Per-token GPU→CPU sync in TokenIterator.next()

Evaluate.swift:451 — extracting the integer token value forces synchronization:

return previousY.tokens.item(Int.self)  // Forces GPU→CPU sync every token

Python's generate_step uses mx.async_eval() to pipeline the next token computation while the current token is being extracted, avoiding this serial bottleneck.

3. No stream context for kernel fusion

Python wraps generation in mx.stream() contexts that enable implicit kernel fusion across operations. Swift has no equivalent — each operation is scheduled individually, preventing the MLX compiler from fusing MoE routing operations.

Why MoE is disproportionately affected

  • Dense model: ~8-10 major operations per forward pass per token
  • MoE model: ~30-40 operations per forward pass (expert gate + N expert MLPs + merge)
  • Each operation hits the evalLock and triggers individual GPU scheduling
  • The 7.3x gap roughly matches the ratio of operations per token (MoE vs dense)

Possible fixes

  1. Remove or weaken evalLock — if mlx_async_eval is thread-safe at the C level, the Swift lock may be unnecessary
  2. Add stream context support — equivalent to Python's mx.stream() for kernel fusion
  3. Pipeline token extraction — schedule next token computation before extracting current token integer value
  4. Batch expert operations — reduce the number of individual eval calls per MoE forward pass

Impact

We use mlx-swift-lm as the inference backend for Fae, a macOS voice assistant. The MoE slowdown prevents us from using Qwen3.5-35B-A3B at its full potential on 64+ GB Apple Silicon machines. At Python-equivalent speeds (~85 T/s), this model would be excellent for real-time voice interaction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions