check_error() throws inside Metal completion handler, hits std::terminate

## Summary

`check_error(MTL::CommandBuffer*)` in `mlx/backend/metal/eval.cpp` can throw `std::runtime_error` inside a Metal completion handler callback dispatched by GCD. There is no try/catch in the handler, so the exception hits `std::terminate` -> `abort()` -> SIGABRT.

## Environment

- M2 Ultra 128GB, macOS 26.3.1
- MLX 0.31.2-dev (branch feat/chunked-sdpa-logsumexp, includes PR #3281)
- Qwen3.5-122B-A10B, 5-bit, 32K context inference

## Crash Stack (from .ips report)

```
libc++abi: __cxa_throw -> failed_throw -> std::terminate -> abort
libmlx.dylib: check_error(MTL::CommandBuffer*)
libmlx.dylib: [completion handler block]
Metal: -[_MTLCommandBuffer didCompleteWithStartTime:endTime:error:]
IOGPU: -[IOGPUMetalCommandBuffer fillCommandBufferArgs:]_block_invoke
[GCD dispatch chain]
```

## Root Cause

In `eval.cpp` the completion handler is:

```cpp
command_buffer->addCompletedHandler(
    [s, buffers = std::move(buffers)](MTL::CommandBuffer* cbuf) {
        scheduler::notify_task_completion(s);
        check_error(cbuf);  // throws std::runtime_error if GPU error
    });
```

When the GPU reports a command buffer execution error (OOM, timeout, or corrupted state), `check_error` throws. But this callback runs on a GCD dispatch thread, not in a context where C++ exceptions can be caught. The throw goes straight to `std::terminate`.

## Reproduction

This occurred during sustained 32K-context inference (250 sequential requests, lm_eval RULER benchmark). The crash happens after ~3.5 hours of continuous operation. It's not easily reproducible on demand but occurs reliably under sustained load.

## Suggested Fix

Catch the exception inside the completion handler and propagate the error through a different mechanism (e.g., an error flag checked at the next `eval()` synchronization point):

```cpp
command_buffer->addCompletedHandler(
    [s, buffers = std::move(buffers)](MTL::CommandBuffer* cbuf) {
        scheduler::notify_task_completion(s);
        try {
            check_error(cbuf);
        } catch (const std::exception& e) {
            // Store error for propagation at next sync point
            scheduler::set_task_error(s, e.what());
        }
    });
```

## Related

- #3216 -- Metal driver SIGABRT (different bug: thread safety race). Fixed by PR #3281 (merged).
- This is a separate bug: the completion handler throw path, not thread safety.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check_error() throws inside Metal completion handler, hits std::terminate #3317

Summary

Environment

Crash Stack (from .ips report)

Root Cause

Reproduction

Suggested Fix

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

check_error() throws inside Metal completion handler, hits std::terminate #3317

Description

Summary

Environment

Crash Stack (from .ips report)

Root Cause

Reproduction

Suggested Fix

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions