Skip to content

SIGSEGV in QuantizedMatmul::eval_gpu during long token generation on Mac Studio M2 Ultra #3216

@Thump604

Description

@Thump604

Description

Sporadic SIGSEGV crashes during quantized inference on Mac Studio M2 Ultra (Mac14,14). The crash occurs in Apple's Metal driver (AGXMetalG14X) when MLX's QuantizedMatmul::eval_gpu creates a new command encoder while the driver is coalescing a previous compute command buffer.

The crash is not caused by concurrent API requests — we serialize all inference requests through an asyncio lock (one request at a time). It appears to be an internal race condition between MLX's GPU dispatch threads and Metal's command buffer management.

Environment

  • Hardware: Mac Studio M2 Ultra (76-core GPU, 128GB unified memory, wired to 115GB via MLX_MAX_TBP_SIZE_GB)
  • macOS: 26.3 (25D125)
  • MLX: 0.31.0
  • Python: 3.12.12
  • Model: Qwen3.5-27B-8bit (MLX quantized format)
  • Serving: vllm-mlx 0.2.6

Reproduction

Crashes happen during long token generation (thousands of tokens, e.g. generating large JSON output). Short responses (< 100 tokens) work reliably. 37 crashes collected over 2 days (March 13-14) during normal single-request use. The server auto-restarts via LaunchAgent after each crash.

# Typical invocation (via vllm-mlx serve)
vllm-mlx serve ~/ai-models/mlx_models/Qwen3.5-27B-8bit \
    --host 127.0.0.1 --port 8080 \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice --mllm --timeout 2400

Crash Details

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Subtype: KERN_INVALID_ADDRESS at 0x0065c62f691dadd0 -> 0x0000462f691dadd0
                   (possible pointer authentication failure)

Faulting Thread: 21 (of 52 total)
Threads with MLX frames: [1, 2, 3, 4, 5, 6, 7, 8, 9, 21, 51] (11 total)

Faulting Thread Stack

0: libobjc.A.dylib :: objc_msgSend
1: libobjc.A.dylib :: -[NSObject isMemberOfClass:]
2: AGXMetalG14X :: -[AGXG14XFamilyCommandBuffer tryCoalescingPreviousComputeCommandEncoderWithConfig:nextEncoderClass:]
3: AGXMetalG14X :: -[AGXG14XFamilyCommandBuffer computeCommandEncoderWithConfig:]
4: AGXMetalG14X :: -[AGXG14XFamilyCommandBuffer computeCommandEncoderWithDispatchType:]
5: libmlx.dylib :: mlx::core::metal::CommandEncoder::CommandEncoder(mlx::core::metal::DeviceStream&)
6: libmlx.dylib :: mlx::core::metal::Device::get_command_encoder(int)
7: libmlx.dylib :: mlx::core::qmv(...)
8: libmlx.dylib :: mlx::core::QuantizedMatmul::eval_gpu(...)
9: libmlx.dylib :: mlx::core::gpu::eval(mlx::core::array&)
10: libmlx.dylib :: mlx::core::eval_impl(...)
11: libmlx.dylib :: mlx::core::async_eval(...)

A second crash on the same day showed KERN_INVALID_ADDRESS at 0x0000000000000000 (null pointer dereference) on a different thread (28), also originating from MLX GPU evaluation.

Mitigations Attempted (all failed)

Mitigation Result
MLX_NUM_THREADS=1 Still crashes
TOKENIZERS_PARALLELISM=false Still crashes (eliminated 24 idle rayon threads)
External request serialization (asyncio Lock) Already in place — crash is internal

Thread Analysis

At crash time, 52 threads were active in the process:

Threads Role State
T1-T9 MLX ThreadPool workers Idle (condition_variable::wait)
T10-T20 Python eval frames Running
T21 MLX GPU dispatch (CRASHED) QuantizedMatmul::eval_gpu → Metal tryCoalescingPreviousComputeCommandEncoder
T22-T45 HuggingFace tokenizers rayon Idle (sleeping)
T46 IOGPU _submitCommandBuffers
T51 MLX eval_impl async_eval → waiting

The key observation: T21 and T46 were both active in GPU command buffer operations simultaneously — T21 creating a new command encoder while T46 submitted command buffers. The pointer authentication failure suggests a use-after-free or stale reference in the Metal command buffer being coalesced.

Additional Context

  • We also see this in mlx-examples #866 where M2 Ultra fails while M3 Max succeeds on the same model.
  • The crash only occurs during extended quantized matmul operations (long token generation). Short generations are stable.
  • Full macOS crash reports (.ips files) are available if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions