MoE models 7.3x slower than Python mlx-lm (per-token sync + evalLock)

## Summary

MoE models (Qwen3.5-35B-A3B) generate at **11.7 T/s** in mlx-swift-lm vs **85 T/s** (MLX-internal) in Python mlx-lm — a **7.3x slowdown**. Dense models (Qwen3-8B) show only a 1.33x gap (52.8 vs 70.2 T/s), confirming the issue is MoE-specific.

## Environment

- **Hardware:** Apple M4 Max, 96 GB unified memory
- **mlx-swift:** 0.30.6
- **mlx-swift-lm:** commit 11968af (latest HEAD)
- **Python mlx-lm:** 0.30.7, mlx 0.31.0
- **Model:** `NexVeridian/Qwen3.5-35B-A3B-4bit` (MoE, 35B total / 3B active per token)

## Reproduction

### Swift (11.7 T/s)

```swift
let config = ModelConfiguration(id: "NexVeridian/Qwen3.5-35B-A3B-4bit")
let container = try await LLMModelFactory.shared.loadContainer(configuration: config)
let input = UserInput(chat: [
    .system("/no_think\nYou are a helpful assistant."),
    .user("What is the weather like today?"),
])
let lmInput = try await container.prepare(input: input)
let params = GenerateParameters(maxTokens: 128, temperature: 0.7)
let stream = try await container.generate(input: lmInput, parameters: params)
for await generation in stream {
    // info.tokensPerSecond reports ~11.7 T/s
}
```

### Python (85 T/s)

```python
from mlx_lm import load, generate
model, tokenizer = load("NexVeridian/Qwen3.5-35B-A3B-4bit")
# Verbose output shows: Generation: 128 tokens, 85.094 tokens-per-sec
output = generate(model, tokenizer, prompt=prompt_text, max_tokens=128, verbose=True)
```

## Verified measurements

| Model | Type | Python mlx-lm (internal T/s) | Swift mlx-swift-lm (internal T/s) | Ratio |
|---|---|---:|---:|---|
| Qwen3-8B | Dense | 70.2 | 52.8 | 1.33x |
| Qwen3.5-35B-A3B | MoE | 85.1 | 11.7 | **7.3x** |

Python numbers verified across 3 consistent runs (72 T/s wall-time, 85 T/s MLX-internal).

## Root cause analysis

We traced the bottleneck to three patterns in the Swift generation loop:

### 1. Global `evalLock` serializes all GPU operations

`Transforms+Eval.swift:9` — every `eval()` and `asyncEval()` acquires a global `NSRecursiveLock`:

```swift
let evalLock = NSRecursiveLock()

public func asyncEval(_ arrays: some Collection<MLXArray>) {
    let vector_array = new_mlx_vector_array(arrays)
    _ = evalLock.withLock {
        mlx_async_eval(vector_array)
    }
    mlx_vector_array_free(vector_array)
}
```

MoE models perform many more intermediate operations per token (expert gating, routing, multiple expert MLPs, output merging), each acquiring this lock. Dense models do far fewer operations per token, so the lock overhead is proportionally smaller.

### 2. Per-token GPU→CPU sync in `TokenIterator.next()`

`Evaluate.swift:451` — extracting the integer token value forces synchronization:

```swift
return previousY.tokens.item(Int.self)  // Forces GPU→CPU sync every token
```

Python's `generate_step` uses `mx.async_eval()` to pipeline the next token computation while the current token is being extracted, avoiding this serial bottleneck.

### 3. No stream context for kernel fusion

Python wraps generation in `mx.stream()` contexts that enable implicit kernel fusion across operations. Swift has no equivalent — each operation is scheduled individually, preventing the MLX compiler from fusing MoE routing operations.

## Why MoE is disproportionately affected

- Dense model: ~8-10 major operations per forward pass per token
- MoE model: ~30-40 operations per forward pass (expert gate + N expert MLPs + merge)
- Each operation hits the `evalLock` and triggers individual GPU scheduling
- The 7.3x gap roughly matches the ratio of operations per token (MoE vs dense)

## Possible fixes

1. **Remove or weaken `evalLock`** — if `mlx_async_eval` is thread-safe at the C level, the Swift lock may be unnecessary
2. **Add stream context support** — equivalent to Python's `mx.stream()` for kernel fusion
3. **Pipeline token extraction** — schedule next token computation before extracting current token integer value
4. **Batch expert operations** — reduce the number of individual eval calls per MoE forward pass

## Impact

We use mlx-swift-lm as the inference backend for [Fae](https://github.com/saorsa-labs/fae), a macOS voice assistant. The MoE slowdown prevents us from using Qwen3.5-35B-A3B at its full potential on 64+ GB Apple Silicon machines. At Python-equivalent speeds (~85 T/s), this model would be excellent for real-time voice interaction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE models 7.3x slower than Python mlx-lm (per-token sync + evalLock) #124

Summary

Environment

Reproduction

Swift (11.7 T/s)

Python (85 T/s)

Verified measurements

Root cause analysis

1. Global `evalLock` serializes all GPU operations

2. Per-token GPU→CPU sync in `TokenIterator.next()`

3. No stream context for kernel fusion

Why MoE is disproportionately affected

Possible fixes

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Type	Python mlx-lm (internal T/s)	Swift mlx-swift-lm (internal T/s)	Ratio
Qwen3-8B	Dense	70.2	52.8	1.33x
Qwen3.5-35B-A3B	MoE	85.1	11.7	7.3x

MoE models 7.3x slower than Python mlx-lm (per-token sync + evalLock) #124

Description

Summary

Environment

Reproduction

Swift (11.7 T/s)

Python (85 T/s)

Verified measurements

Root cause analysis

1. Global evalLock serializes all GPU operations

2. Per-token GPU→CPU sync in TokenIterator.next()

3. No stream context for kernel fusion

Why MoE is disproportionately affected

Possible fixes

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Global `evalLock` serializes all GPU operations

2. Per-token GPU→CPU sync in `TokenIterator.next()`