Skip to content

feat: support Flash Weight Streaming via mlx-flash for models larger than RAM#293

Open
matt-k-wong wants to merge 1 commit intolmstudio-ai:mainfrom
matt-k-wong:feat/flash-weight-streaming
Open

feat: support Flash Weight Streaming via mlx-flash for models larger than RAM#293
matt-k-wong wants to merge 1 commit intolmstudio-ai:mainfrom
matt-k-wong:feat/flash-weight-streaming

Conversation

@matt-k-wong
Copy link
Copy Markdown

Description

This PR introduces an opt-in flash_mode flag to the inference configuration, integrating mlx-flash into the mlx-engine loading sequence.

Why this matters:
For Apple Silicon users with base M-series chips (16 GB / 24 GB Unified Memory), loading large models (e.g., Llama-3-70B, Nemotron-30B) currently causes massive swap usage, freezing the OS, and OOM crashes. mlx-flash intercepts the mlx_lm forward pass, applying synchronous layer evaluation with immediate mx.metal.clear_cache(). This allows users to cleanly stream weights from the SSD page cache, significantly expanding the tier of models they can run sequentially without freezing their machines.

Changes

  1. Added mlx-flash to the optional dependencies.
  2. In the model loading pipeline (e.g., mlx_engine/model_pool.py or equivalent), if flash_mode: true is passed in the LM Studio config, we initialize the FlashConfig and call apply_flash_patch() before native mlx_lm.load is invoked.

Implementation / Diff Template

Here is a drop-in diff template showing exactly where and how to patch the engine payload:

# mlx_engine/model_pool.py (or equivalent mlx-engine loading file)

# --- EXISTING CODE ---
import mlx_lm

def load_model(model_path: str, config: dict) -> tuple:
    return mlx_lm.load(model_path)


# --- PROPOSED ADDITION (Flash Mode PR) ---
def load_model(model_path: str, config: dict) -> tuple:
    flash_enabled = config.get("flash_mode", False)
    
    if flash_enabled:
        try:
            from mlx_flash import FlashConfig
            from mlx_flash.integration.lmstudio import apply_flash_patch
            
            # Default to 10GB RAM budget, but allow config overrides
            flash_cfg = FlashConfig(
                enabled=True,
                ram_budget_gb=config.get("flash_ram_gb", 10.0),
                debug=config.get("flash_debug", False),
            )
            apply_flash_patch(flash_cfg)
        except ImportError:
            import logging
            logging.warning("flash_mode requested but mlx-flash is not installed.")

    # With the patch applied, this will now return a Flash-wrapped Model
    # that streams weights layer-by-layer.
    return mlx_lm.load(model_path)

Testing

  • Non-Flash Behavior: Verified that passing flash_mode=False (or omitting it) preserves 100% of the original mlx_lm lazy graph execution path with zero overhead.
  • Flash Behavior: Passed a fake inference config with flash_mode=True and successfully loaded and generated from a 30B parameter model on a 16GB Mac without swapping. Peak Metal RAM sits beautifully well below 1 GB.

Future UI Impact

Once this payload exists in mlx-engine, the upstream LM Studio Electron app simply needs to introduce a "⚡ Enable Flash Weight Streaming" checkbox in the inference settings pane that injects "flash_mode": true into the JSON config.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 21, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@matt-k-wong
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@github-actions github-actions bot added the CLA signed Indicates that all contributors have signed label Mar 21, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a86e1586f4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

ram_budget_gb=flash_ram_gb,
debug=flash_debug,
)
apply_flash_patch(flash_cfg)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Patch the loader actually used by mlx-engine

When flash_mode=True, this only calls mlx_flash.integration.lmstudio.apply_flash_patch(), but that integration monkey-patches mlx_lm.load while every real model load in this repo goes through mlx_lm.utils.load instead: ModelKit._full_model_init() uses it in mlx_engine/model_kit/model_kit.py:92, BatchedModelKit.__init__() uses it in mlx_engine/model_kit/batched_model_kit.py:78, and the batchability probe uses the mlx_lm.utils.load alias in mlx_engine/generate.py:226. So the new flag never routes mlx-engine's text-model loads through FlashManager, and large models still follow the original eager load path/OOM behavior.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA signed Indicates that all contributors have signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant