SIGSEGV in QuantizedMatmul::eval_gpu during long token generation on Mac Studio M2 Ultra

## Description

Sporadic SIGSEGV crashes during quantized inference on Mac Studio M2 Ultra (Mac14,14). The crash occurs in Apple's Metal driver (`AGXMetalG14X`) when MLX's `QuantizedMatmul::eval_gpu` creates a new command encoder while the driver is coalescing a previous compute command buffer.

The crash is **not** caused by concurrent API requests — we serialize all inference requests through an asyncio lock (one request at a time). It appears to be an internal race condition between MLX's GPU dispatch threads and Metal's command buffer management.

## Environment

- **Hardware:** Mac Studio M2 Ultra (76-core GPU, 128GB unified memory, wired to 115GB via `MLX_MAX_TBP_SIZE_GB`)
- **macOS:** 26.3 (25D125)
- **MLX:** 0.31.0
- **Python:** 3.12.12
- **Model:** Qwen3.5-27B-8bit (MLX quantized format)
- **Serving:** vllm-mlx 0.2.6

## Reproduction

Crashes happen during **long token generation** (thousands of tokens, e.g. generating large JSON output). Short responses (< 100 tokens) work reliably. 37 crashes collected over 2 days (March 13-14) during normal single-request use. The server auto-restarts via LaunchAgent after each crash.

```bash
# Typical invocation (via vllm-mlx serve)
vllm-mlx serve ~/ai-models/mlx_models/Qwen3.5-27B-8bit \
    --host 127.0.0.1 --port 8080 \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice --mllm --timeout 2400
```

## Crash Details

```
Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Subtype: KERN_INVALID_ADDRESS at 0x0065c62f691dadd0 -> 0x0000462f691dadd0
                   (possible pointer authentication failure)

Faulting Thread: 21 (of 52 total)
Threads with MLX frames: [1, 2, 3, 4, 5, 6, 7, 8, 9, 21, 51] (11 total)
```

### Faulting Thread Stack
```
0: libobjc.A.dylib :: objc_msgSend
1: libobjc.A.dylib :: -[NSObject isMemberOfClass:]
2: AGXMetalG14X :: -[AGXG14XFamilyCommandBuffer tryCoalescingPreviousComputeCommandEncoderWithConfig:nextEncoderClass:]
3: AGXMetalG14X :: -[AGXG14XFamilyCommandBuffer computeCommandEncoderWithConfig:]
4: AGXMetalG14X :: -[AGXG14XFamilyCommandBuffer computeCommandEncoderWithDispatchType:]
5: libmlx.dylib :: mlx::core::metal::CommandEncoder::CommandEncoder(mlx::core::metal::DeviceStream&)
6: libmlx.dylib :: mlx::core::metal::Device::get_command_encoder(int)
7: libmlx.dylib :: mlx::core::qmv(...)
8: libmlx.dylib :: mlx::core::QuantizedMatmul::eval_gpu(...)
9: libmlx.dylib :: mlx::core::gpu::eval(mlx::core::array&)
10: libmlx.dylib :: mlx::core::eval_impl(...)
11: libmlx.dylib :: mlx::core::async_eval(...)
```

A second crash on the same day showed `KERN_INVALID_ADDRESS at 0x0000000000000000` (null pointer dereference) on a different thread (28), also originating from MLX GPU evaluation.

## Mitigations Attempted (all failed)

| Mitigation | Result |
|---|---|
| `MLX_NUM_THREADS=1` | Still crashes |
| `TOKENIZERS_PARALLELISM=false` | Still crashes (eliminated 24 idle rayon threads) |
| External request serialization (asyncio Lock) | Already in place — crash is internal |

## Thread Analysis

At crash time, 52 threads were active in the process:

| Threads | Role | State |
|---|---|---|
| T1-T9 | MLX ThreadPool workers | Idle (condition_variable::wait) |
| T10-T20 | Python eval frames | Running |
| **T21** | **MLX GPU dispatch (CRASHED)** | `QuantizedMatmul::eval_gpu` → Metal `tryCoalescingPreviousComputeCommandEncoder` |
| T22-T45 | HuggingFace tokenizers rayon | Idle (sleeping) |
| T46 | IOGPU | `_submitCommandBuffers` |
| T51 | MLX eval_impl | `async_eval` → waiting |

The key observation: **T21 and T46 were both active in GPU command buffer operations simultaneously** — T21 creating a new command encoder while T46 submitted command buffers. The pointer authentication failure suggests a use-after-free or stale reference in the Metal command buffer being coalesced.

## Additional Context

- We also see this in [mlx-examples #866](https://github.com/ml-explore/mlx-examples/issues/866) where M2 Ultra fails while M3 Max succeeds on the same model.
- The crash only occurs during extended quantized matmul operations (long token generation). Short generations are stable.
- Full macOS crash reports (`.ips` files) are available if needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV in QuantizedMatmul::eval_gpu during long token generation on Mac Studio M2 Ultra #3216

Description

Environment

Reproduction

Crash Details

Faulting Thread Stack

Mitigations Attempted (all failed)

Thread Analysis

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mitigation	Result
`MLX_NUM_THREADS=1`	Still crashes
`TOKENIZERS_PARALLELISM=false`	Still crashes (eliminated 24 idle rayon threads)
External request serialization (asyncio Lock)	Already in place — crash is internal

Threads	Role	State
T1-T9	MLX ThreadPool workers	Idle (condition_variable::wait)
T10-T20	Python eval frames	Running
T21	MLX GPU dispatch (CRASHED)	`QuantizedMatmul::eval_gpu` → Metal `tryCoalescingPreviousComputeCommandEncoder`
T22-T45	HuggingFace tokenizers rayon	Idle (sleeping)
T46	IOGPU	`_submitCommandBuffers`
T51	MLX eval_impl	`async_eval` → waiting

SIGSEGV in QuantizedMatmul::eval_gpu during long token generation on Mac Studio M2 Ultra #3216

Description

Description

Environment

Reproduction

Crash Details

Faulting Thread Stack

Mitigations Attempted (all failed)

Thread Analysis

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions