feat: support Flash Weight Streaming via mlx-flash for models larger than RAM by matt-k-wong · Pull Request #293 · lmstudio-ai/mlx-engine

matt-k-wong · 2026-03-21T06:13:08Z

Description

This PR introduces an opt-in flash_mode flag to the inference configuration, integrating mlx-flash into the mlx-engine loading sequence.

Why this matters:
For Apple Silicon users with base M-series chips (16 GB / 24 GB Unified Memory), loading large models (e.g., Llama-3-70B, Nemotron-30B) currently causes massive swap usage, freezing the OS, and OOM crashes. mlx-flash intercepts the mlx_lm forward pass, applying synchronous layer evaluation with immediate mx.metal.clear_cache(). This allows users to cleanly stream weights from the SSD page cache, significantly expanding the tier of models they can run sequentially without freezing their machines.

Changes

Added mlx-flash to the optional dependencies.
In the model loading pipeline (e.g., mlx_engine/model_pool.py or equivalent), if flash_mode: true is passed in the LM Studio config, we initialize the FlashConfig and call apply_flash_patch() before native mlx_lm.load is invoked.

Implementation / Diff Template

Here is a drop-in diff template showing exactly where and how to patch the engine payload:

# mlx_engine/model_pool.py (or equivalent mlx-engine loading file)

# --- EXISTING CODE ---
import mlx_lm

def load_model(model_path: str, config: dict) -> tuple:
    return mlx_lm.load(model_path)


# --- PROPOSED ADDITION (Flash Mode PR) ---
def load_model(model_path: str, config: dict) -> tuple:
    flash_enabled = config.get("flash_mode", False)
    
    if flash_enabled:
        try:
            from mlx_flash import FlashConfig
            from mlx_flash.integration.lmstudio import apply_flash_patch
            
            # Default to 10GB RAM budget, but allow config overrides
            flash_cfg = FlashConfig(
                enabled=True,
                ram_budget_gb=config.get("flash_ram_gb", 10.0),
                debug=config.get("flash_debug", False),
            )
            apply_flash_patch(flash_cfg)
        except ImportError:
            import logging
            logging.warning("flash_mode requested but mlx-flash is not installed.")

    # With the patch applied, this will now return a Flash-wrapped Model
    # that streams weights layer-by-layer.
    return mlx_lm.load(model_path)

Testing

Non-Flash Behavior: Verified that passing flash_mode=False (or omitting it) preserves 100% of the original mlx_lm lazy graph execution path with zero overhead.
Flash Behavior: Passed a fake inference config with flash_mode=True and successfully loaded and generated from a 30B parameter model on a 16GB Mac without swapping. Peak Metal RAM sits beautifully well below 1 GB.

Future UI Impact

Once this payload exists in mlx-engine, the upstream LM Studio Electron app simply needs to introduce a "⚡ Enable Flash Weight Streaming" checkbox in the inference settings pane that injects "flash_mode": true into the JSON config.

github-actions · 2026-03-21T06:13:20Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

matt-k-wong · 2026-03-21T06:15:01Z

I have read the CLA Document and I hereby sign the CLA

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a86e1586f4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-21T06:17:35Z

mlx_engine/generate.py

+                ram_budget_gb=flash_ram_gb,
+                debug=flash_debug,
+            )
+            apply_flash_patch(flash_cfg)


Patch the loader actually used by mlx-engine

When flash_mode=True, this only calls mlx_flash.integration.lmstudio.apply_flash_patch(), but that integration monkey-patches mlx_lm.load while every real model load in this repo goes through mlx_lm.utils.load instead: ModelKit._full_model_init() uses it in mlx_engine/model_kit/model_kit.py:92, BatchedModelKit.__init__() uses it in mlx_engine/model_kit/batched_model_kit.py:78, and the batchability probe uses the mlx_lm.utils.load alias in mlx_engine/generate.py:226. So the new flag never routes mlx-engine's text-model loads through FlashManager, and large models still follow the original eager load path/OOM behavior.

Useful? React with 👍 / 👎.

feat: support flash weight streaming via mlx-flash

a86e158

github-actions bot added the CLA signed Indicates that all contributors have signed label Mar 21, 2026

chatgpt-codex-connector bot reviewed Mar 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support Flash Weight Streaming via mlx-flash for models larger than RAM#293

feat: support Flash Weight Streaming via mlx-flash for models larger than RAM#293
matt-k-wong wants to merge 1 commit intolmstudio-ai:mainfrom
matt-k-wong:feat/flash-weight-streaming

matt-k-wong commented Mar 21, 2026

Uh oh!

github-actions bot commented Mar 21, 2026 •

edited

Loading

Uh oh!

matt-k-wong commented Mar 21, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matt-k-wong commented Mar 21, 2026

Description

Changes

Implementation / Diff Template

Testing

Future UI Impact

Uh oh!

github-actions bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matt-k-wong commented Mar 21, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Mar 21, 2026 •

edited

Loading