-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Description
Sporadic SIGSEGV crashes during quantized inference on Mac Studio M2 Ultra (Mac14,14). The crash occurs in Apple's Metal driver (AGXMetalG14X) when MLX's QuantizedMatmul::eval_gpu creates a new command encoder while the driver is coalescing a previous compute command buffer.
The crash is not caused by concurrent API requests — we serialize all inference requests through an asyncio lock (one request at a time). It appears to be an internal race condition between MLX's GPU dispatch threads and Metal's command buffer management.
Environment
- Hardware: Mac Studio M2 Ultra (76-core GPU, 128GB unified memory, wired to 115GB via
MLX_MAX_TBP_SIZE_GB) - macOS: 26.3 (25D125)
- MLX: 0.31.0
- Python: 3.12.12
- Model: Qwen3.5-27B-8bit (MLX quantized format)
- Serving: vllm-mlx 0.2.6
Reproduction
Crashes happen during long token generation (thousands of tokens, e.g. generating large JSON output). Short responses (< 100 tokens) work reliably. 37 crashes collected over 2 days (March 13-14) during normal single-request use. The server auto-restarts via LaunchAgent after each crash.
# Typical invocation (via vllm-mlx serve)
vllm-mlx serve ~/ai-models/mlx_models/Qwen3.5-27B-8bit \
--host 127.0.0.1 --port 8080 \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--enable-auto-tool-choice --mllm --timeout 2400Crash Details
Exception Type: EXC_BAD_ACCESS (SIGSEGV)
Exception Subtype: KERN_INVALID_ADDRESS at 0x0065c62f691dadd0 -> 0x0000462f691dadd0
(possible pointer authentication failure)
Faulting Thread: 21 (of 52 total)
Threads with MLX frames: [1, 2, 3, 4, 5, 6, 7, 8, 9, 21, 51] (11 total)
Faulting Thread Stack
0: libobjc.A.dylib :: objc_msgSend
1: libobjc.A.dylib :: -[NSObject isMemberOfClass:]
2: AGXMetalG14X :: -[AGXG14XFamilyCommandBuffer tryCoalescingPreviousComputeCommandEncoderWithConfig:nextEncoderClass:]
3: AGXMetalG14X :: -[AGXG14XFamilyCommandBuffer computeCommandEncoderWithConfig:]
4: AGXMetalG14X :: -[AGXG14XFamilyCommandBuffer computeCommandEncoderWithDispatchType:]
5: libmlx.dylib :: mlx::core::metal::CommandEncoder::CommandEncoder(mlx::core::metal::DeviceStream&)
6: libmlx.dylib :: mlx::core::metal::Device::get_command_encoder(int)
7: libmlx.dylib :: mlx::core::qmv(...)
8: libmlx.dylib :: mlx::core::QuantizedMatmul::eval_gpu(...)
9: libmlx.dylib :: mlx::core::gpu::eval(mlx::core::array&)
10: libmlx.dylib :: mlx::core::eval_impl(...)
11: libmlx.dylib :: mlx::core::async_eval(...)
A second crash on the same day showed KERN_INVALID_ADDRESS at 0x0000000000000000 (null pointer dereference) on a different thread (28), also originating from MLX GPU evaluation.
Mitigations Attempted (all failed)
| Mitigation | Result |
|---|---|
MLX_NUM_THREADS=1 |
Still crashes |
TOKENIZERS_PARALLELISM=false |
Still crashes (eliminated 24 idle rayon threads) |
| External request serialization (asyncio Lock) | Already in place — crash is internal |
Thread Analysis
At crash time, 52 threads were active in the process:
| Threads | Role | State |
|---|---|---|
| T1-T9 | MLX ThreadPool workers | Idle (condition_variable::wait) |
| T10-T20 | Python eval frames | Running |
| T21 | MLX GPU dispatch (CRASHED) | QuantizedMatmul::eval_gpu → Metal tryCoalescingPreviousComputeCommandEncoder |
| T22-T45 | HuggingFace tokenizers rayon | Idle (sleeping) |
| T46 | IOGPU | _submitCommandBuffers |
| T51 | MLX eval_impl | async_eval → waiting |
The key observation: T21 and T46 were both active in GPU command buffer operations simultaneously — T21 creating a new command encoder while T46 submitted command buffers. The pointer authentication failure suggests a use-after-free or stale reference in the Metal command buffer being coalesced.
Additional Context
- We also see this in mlx-examples #866 where M2 Ultra fails while M3 Max succeeds on the same model.
- The crash only occurs during extended quantized matmul operations (long token generation). Short generations are stable.
- Full macOS crash reports (
.ipsfiles) are available if needed.