feat: support Flash Weight Streaming via mlx-flash for models larger than RAM#293
feat: support Flash Weight Streaming via mlx-flash for models larger than RAM#293matt-k-wong wants to merge 1 commit intolmstudio-ai:mainfrom
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a86e1586f4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ram_budget_gb=flash_ram_gb, | ||
| debug=flash_debug, | ||
| ) | ||
| apply_flash_patch(flash_cfg) |
There was a problem hiding this comment.
Patch the loader actually used by mlx-engine
When flash_mode=True, this only calls mlx_flash.integration.lmstudio.apply_flash_patch(), but that integration monkey-patches mlx_lm.load while every real model load in this repo goes through mlx_lm.utils.load instead: ModelKit._full_model_init() uses it in mlx_engine/model_kit/model_kit.py:92, BatchedModelKit.__init__() uses it in mlx_engine/model_kit/batched_model_kit.py:78, and the batchability probe uses the mlx_lm.utils.load alias in mlx_engine/generate.py:226. So the new flag never routes mlx-engine's text-model loads through FlashManager, and large models still follow the original eager load path/OOM behavior.
Useful? React with 👍 / 👎.
Description
This PR introduces an opt-in
flash_modeflag to the inference configuration, integratingmlx-flashinto the mlx-engine loading sequence.Why this matters:
For Apple Silicon users with base M-series chips (16 GB / 24 GB Unified Memory), loading large models (e.g., Llama-3-70B, Nemotron-30B) currently causes massive swap usage, freezing the OS, and OOM crashes.
mlx-flashintercepts themlx_lmforward pass, applying synchronous layer evaluation with immediatemx.metal.clear_cache(). This allows users to cleanly stream weights from the SSD page cache, significantly expanding the tier of models they can run sequentially without freezing their machines.Changes
mlx-flashto the optional dependencies.mlx_engine/model_pool.pyor equivalent), ifflash_mode: trueis passed in the LM Studio config, we initialize theFlashConfigand callapply_flash_patch()before nativemlx_lm.loadis invoked.Implementation / Diff Template
Here is a drop-in diff template showing exactly where and how to patch the engine payload:
Testing
flash_mode=False(or omitting it) preserves 100% of the originalmlx_lmlazy graph execution path with zero overhead.flash_mode=Trueand successfully loaded and generated from a 30B parameter model on a 16GB Mac without swapping. Peak Metal RAM sits beautifully well below 1 GB.Future UI Impact
Once this payload exists in
mlx-engine, the upstream LM Studio Electron app simply needs to introduce a "⚡ Enable Flash Weight Streaming" checkbox in the inference settings pane that injects"flash_mode": trueinto the JSON config.