-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Summary
check_error(MTL::CommandBuffer*) in mlx/backend/metal/eval.cpp can throw std::runtime_error inside a Metal completion handler callback dispatched by GCD. There is no try/catch in the handler, so the exception hits std::terminate -> abort() -> SIGABRT.
Environment
- M2 Ultra 128GB, macOS 26.3.1
- MLX 0.31.2-dev (branch feat/chunked-sdpa-logsumexp, includes PR Make each thread have its own default stream #3281)
- Qwen3.5-122B-A10B, 5-bit, 32K context inference
Crash Stack (from .ips report)
libc++abi: __cxa_throw -> failed_throw -> std::terminate -> abort
libmlx.dylib: check_error(MTL::CommandBuffer*)
libmlx.dylib: [completion handler block]
Metal: -[_MTLCommandBuffer didCompleteWithStartTime:endTime:error:]
IOGPU: -[IOGPUMetalCommandBuffer fillCommandBufferArgs:]_block_invoke
[GCD dispatch chain]
Root Cause
In eval.cpp the completion handler is:
command_buffer->addCompletedHandler(
[s, buffers = std::move(buffers)](MTL::CommandBuffer* cbuf) {
scheduler::notify_task_completion(s);
check_error(cbuf); // throws std::runtime_error if GPU error
});When the GPU reports a command buffer execution error (OOM, timeout, or corrupted state), check_error throws. But this callback runs on a GCD dispatch thread, not in a context where C++ exceptions can be caught. The throw goes straight to std::terminate.
Reproduction
This occurred during sustained 32K-context inference (250 sequential requests, lm_eval RULER benchmark). The crash happens after ~3.5 hours of continuous operation. It's not easily reproducible on demand but occurs reliably under sustained load.
Suggested Fix
Catch the exception inside the completion handler and propagate the error through a different mechanism (e.g., an error flag checked at the next eval() synchronization point):
command_buffer->addCompletedHandler(
[s, buffers = std::move(buffers)](MTL::CommandBuffer* cbuf) {
scheduler::notify_task_completion(s);
try {
check_error(cbuf);
} catch (const std::exception& e) {
// Store error for propagation at next sync point
scheduler::set_task_error(s, e.what());
}
});Related
- SIGSEGV in QuantizedMatmul::eval_gpu during long token generation on Mac Studio M2 Ultra #3216 -- Metal driver SIGABRT (different bug: thread safety race). Fixed by PR Make each thread have its own default stream #3281 (merged).
- This is a separate bug: the completion handler throw path, not thread safety.