Skip to content

Handle Metal OOM gracefully in mlx_lm.server with structured errors#1034

Open
Aristide021 wants to merge 4 commits intoml-explore:mainfrom
Aristide021:server-oom-hardening
Open

Handle Metal OOM gracefully in mlx_lm.server with structured errors#1034
Aristide021 wants to merge 4 commits intoml-explore:mainfrom
Aristide021:server-oom-hardening

Conversation

@Aristide021
Copy link
Copy Markdown

@Aristide021 Aristide021 commented Mar 21, 2026

Classify generation failures in mlx_lm.server and return structured errors instead of crashing or misreporting as 404.

  • Detect Metal/MLX OOM errors and map them to HTTP 503
  • Map other generation exceptions to HTTP 500
  • Return structured JSON error payloads for non-stream responses
  • Emit terminal SSE error event + [DONE] for stream responses
  • Keep server alive after OOM
  • Defer non-stream 200 headers until success response is ready
  • Add OOM regression tests (stream + non-stream) in tests/test_server.py
  • Document OOM behavior and mitigation knobs in mlx_lm/SERVER.md
  • Add extra OOM marker coverage (insufficient memory for buffer)
  • Log classified Metal OOM events for operator visibility

Closes #854
Refs #1015

Aware of #948 (broader memory controls); this PR is intentionally scoped to crash-to-response handling and can merge independently.

Classify generation failures in mlx_lm.server and return structured errors instead of crashing or misreporting as 404.

- Detect Metal/MLX OOM errors and map them to HTTP 503

- Map other generation exceptions to HTTP 500

- Return structured JSON error payloads for non-stream responses

- Emit terminal SSE error event + [DONE] for stream responses

- Keep server alive after OOM

- Defer non-stream 200 headers until success response is ready

- Add OOM regression tests (stream + non-stream) in test_server.py

- Document OOM behavior and mitigation knobs in SERVER.md
@Thump604
Copy link
Copy Markdown

The OOM detection markers look correct for Apple Silicon. The main paths MLX raises on unified memory exhaustion are:

  1. "failed to allocate" from allocator.cpp when MTLDevice allocateBuffer returns nil -- this is the most common path and you've got it covered.
  2. "Metal error: command buffer execution failed due to out of memory" from command buffer submission failure -- also covered.

One gap: when mx.metal.set_memory_limit() is active, MLX can throw "Attempting to allocate X bytes which is greater than the maximum allowed buffer size" (from the metal::malloc limit check). The "failed to allocate" marker wouldn't match that. Worth adding "attempting to allocate" or "maximum allowed buffer size" to the marker list.

The deferred-200 pattern for non-streaming is a good fix. The streaming error path handling (pre-stream vs mid-stream headers) is also correct.

Minor: the error response includes retry_after: 30 which is a reasonable default, but memory recovery on Apple Silicon (unified memory, no separate GPU eviction) really depends on whether other processes release memory. A shorter default (5-10s) might give a better user experience for transient spikes.

- Add marker coverage for 'attempting to allocate' and 'maximum allowed buffer size'

- Add regression test to ensure these variants map to HTTP 503
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mlx_lm.server crashes on Metal OOM instead of returning an HTTP error

2 participants