Flash Weight Streaming for MLX — run models larger than your RAM on Apple Silicon. 30B on 16 GB, 70B+ on 32 GB+. No additional quantisation — uses the model's native precision.
Project Lineage: This implementation is inspired by Apple Research's paper LLM in a Flash (arXiv 2312.11514), which formalized the concept of using the OS page cache for efficient weight streaming. The original
flash-moeproject provided the first Objective-C + Metal proof of concept for this approach on Apple Silicon. This repository (mlx-flash) extends those principles to the Python-based MLX ecosystem, providing a robust, duck-typed integration layer formlx-lm.
- Why Flash Mode?
- How It Works
- Architecture Diagrams
- Performance
- Output Quality
- Quick Start
- LM Studio Usage
- Modelfile Usage
- Technical Deep Dive
- Contributing
| Model | Hardware | Mode | Load Time | Peak Weight RSS | Result |
|---|---|---|---|---|---|
| Nemotron-30B (17.8 GB) | 16GB MacBook Air | Normal | 4.1s | 18+ GB (Swap) | ❌ Laggy |
| Nemotron-30B (17.8 GB) | 16GB MacBook Air | Flash | 0.8s | ~0.5 GB | ✅ Smooth |
Important
Flash Mode is strictly for models that are larger than your RAM.
It allows you to run models of any size (30B, 70B, even 1T+) on base-spec Macs by streaming weights directly from your SSD.
The secret: Synchronous Layer Evaluation. Standard MLX uses "lazy graph evaluation," which attempts to build a massive graph spanning all layers before execution. This causes Metal to attempt allocating all weights at once, leading to OOM.
mlx-flash bypasses this by:
- Loading weights as lazy mmap-backed arrays via
mlx_lm.load(path, lazy=True). - Intercepting the forward pass to execute one layer at a time.
- Forcing materialization via
mx.eval()+mx.synchronize()after each layer.
mlx-flash is a no-compromise quality engine. Unlike other low-RAM solutions, we do not use lossy compression (like 4-bit or 2-bit quantization) to shrink the model into your RAM. Instead, we trade Time for Capacity.
- Bit-for-Bit Identical: Weights are streamed in their native precision (F16/BF16/F32). The tokens generated are identical to running the model on a $6,000 Mac Studio with 192GB of RAM.
- High-Precision KV Cache:
DiskKVCachestores context on your SSD at full precision, avoiding the logic degradation common in 4-bit KV cache implementations. - Deterministic Sampling: Supports the full
mlx-lmsampling suite with perfect reproducibility.
Flash Mode is a hybrid engine. You control exactly how much RAM is traded for speed.
- Safety Profile (1.0 - 2.0 GB): "Slow but Invincible." Forces strict weight streaming. Recommended for 30B+ models on limited RAM.
- Balanced Profile (4.0 - 8.0 GB): "Smart Mode." Caches weights in RAM when possible (10x speedup), but automatically triggers streaming if memory pressure spikes.
- Performance Profile (12.0+ GB): Keeps most of the model in RAM. Maximum speed, only uses Flash Mode for the "overflow."
Tip
Use the Experimental Matrix: Run python scripts/run_matrix_experiments.py to find the exact "Performance Cliff" for your specific model and hardware.
Tip
If your model is "Killed" (Exit code 137): This means your ram_budget_gb + OS overhead exceeded your physical RAM. Lower the budget by 1.0 GB and restart.
- Bottomless Window: Your prompt size is limited only by your SSD free space, not your RAM.
- Eviction Mode: To prevent SSD bloat, we use a Halving Eviction policy by default. When the
max_tokenslimit is reached, it keeps the most recent 50% of the context. (Setmax_tokens=Nonefor absolute perfect recall).
graph TD
A[SSD: .safetensors] --"mmap(lazy=True)"--> B[MLX Lazy Arrays]
A --"Background Thread"--> P[os.pread]
P --"Force OS Page Cache"--> B
B --"FlashLLM Wrapper"--> C{Forward Pass}
subgraph "Per-Layer Loop"
C --"Layer i"--> D[mx.eval]
D --"Sync GPU"--> E[mx.synchronize]
E --"MADV_FREE"--> G[Release RAM]
G --"Next Layer"--> C
end
C --"Final Output"--> G2[Token]
- Lazy Loading:
mlx_lm.load(path, lazy=True)maps the entire model into the unified address space using the macOS page cache. No Metal RAM is consumed at this point. - Async Background Prefetch: We extract exact byte offsets from
.safetensorsheaders and use a background Python thread to explicitlyos.preadpages directly into RAM. This bypasses the Python GIL and hides SSD latency from the GPU. - Pipelined Execution: Instead of building a unified lazy graph for the whole model (which leads to OOM), we build and evaluate a graph for exactly one layer. CPU and GPU syncs are pipelined to maximize throughput.
- Immediate Eviction: After each
mx.eval(), we verify completion, clear the Metal cache, and issuemadvise(MADV_FREE). The weights for the current layer are immediately flushed from unified RAM by macOS. - Efficiency Features: Since
FlashLLMis a drop-in proxy, you get all nativemlx-lmfeatures like quantized KV cache (kv_bits) and sliding windows (max_kv_size) for free.
graph TB
subgraph UI["Execution Interface"]
CL["Python Script / CLI"]
GS["FlashGenerationLoop"]
end
subgraph CORE["mlx-flash"]
FM["FlashManager"]
FLLM["FlashLLM Wrapper\n(Duck-Typed Layer Interceptor)"]
BW["BackgroundPrefetcher\n(Thread drops GIL)"]
PC["page_cache.py\nmadvise(MADV_FREE)"]
end
subgraph METAL["Metal Runtime"]
LA["Lazy Arrays\n(mmap-backed)"]
EV["mx.eval()"]
end
CL --> GS
GS --> FM
FM --"lazy=True"--> LA
GS --"forward"--> FLLM
FLLM --"enqueue chunk"--> BW
BW --"os.pread"--> LA
FLLM --> EV
EV --> CL_C
CL_C --> PC
flowchart LR
subgraph RT["Router Pass (always hot)"]
TOK["Token batch"] --> RW["Router weights"]
RW --> TK["Top-K Experts"]
end
subgraph IO["Parallel Expert I/O"]
TK --> P0["madvise\nExpert 0"]
TK --> P1["madvise\nExpert 1"]
end
subgraph GPU["GPU Compute"]
P0 & P1 --> COM["Sync Combine"]
end
COM --> OUT["Output"]
mlx-flash is engineered for unbreakable reliability on limited hardware. While standard MLX may crash with Insufficient Memory as the context grows, mlx-flash maintains a rigid, deterministic memory footprint.
- Deterministic RAM: By forcing a synchronization and cache clearing after every layer, we ensure that a 70B model uses no more peak RAM than a 7B model.
- Page-Cache Resilience: We leverage the macOS kernel's virtual memory system to "page" context from the SSD. If the SSD is slow, the model simply waits; it never crashes.
- Bit-for-Bit Parity: There is zero "Accuracy Tax." You are running the original high-precision weights with the original sampling logic.
Benchmarked on M4 MacBook Air 16 GB (Internal NVMe). Synthetic 1.5B Llama-style model with 0.1 GB RAM Budget (Extreme Stress Test).
| Context Length | Generation Speed | SSD KV Cache | Peak RAM Overhead |
|---|---|---|---|
| 512 Tokens | 64.1 T/s | 16 MB | ~24 MB |
| 8,192 Tokens | 26.1 T/s | 256 MB | ~256 MB |
| 32,768 Tokens | 9.8 T/s | 1.02 GB | ~816 MB |
Tip
Scaling Limit: Decode speed is
Benchmarked on M4 MacBook Air 16 GB with internal NVMe. With v0.2 Async I/O Prefetching enabled, the OS pulls data from the SSD in the background, keeping the GPU constantly saturated.
| Model | File Size | Flash Weight RAM | + KV Cache (2K ctx) | Mode | Tok/s (M4 Air) |
|---|---|---|---|---|---|
| Qwen2.5-3B | 1.9 GB | ~0.3 GB | ~0.2 GB | RAM | 60-80 |
| Nemotron-30B | 17.8 GB | ~0.5 GB | ~1.8 GB | Stream | ~0.7 |
| Nemotron-30B | 17.8 GB | ~15 GB | ~1.8 GB | RAM | 12.4 |
| Llama-3-70B | 40 GB | ~0.8 GB | ~3.2 GB | Stream | ~0.05 |
Note
Tokens per second benchmarks use max_kv_size=2048. Unlimited context lengths will consume more RAM as the KV cache grows.
- Limited Context RAM (v0.1–v0.3.1): In earlier versions, the KV cache grew in RAM. Use
max_kv_size(sliding window) orkv_bits(quantization) to mitigate. Now fixed with Disk KV Cache offloading (v0.3.2+).
Flash Mode uses the model's weights as-is with no additional quantisation. Output is numerically equivalent to standard mlx-lm inference when using the same model, sampling parameters, and random seed.
Caveat: Per-layer mx.eval() may occasionally produce microscopic differences in floating-point results compared to fused multi-layer evaluation due to the specific order of floating-point operations. In practice, generated text is perceptually identical.
pip install mlx-flashfrom mlx_flash import FlashConfig
from mlx_flash.integration.lmstudio import apply_flash_patch
import mlx_lm
# 1. Enable Flash Mode system-wide for mlx_lm
# ram_budget_gb is your "Master Dial":
# - 1.5: Safe (Works even if RAM is full)
# - 8.0: Performance (Requires free RAM)
apply_flash_patch(FlashConfig(enabled=True, ram_budget_gb=2.0))
# 2. Load any model (e.g., Llama-3-70B on 16GB RAM)
model, tokenizer = mlx_lm.load("mlx-community/Meta-Llama-3-70B-Instruct-4bit")
# 3. Generate — weights will stream automatically
for response in mlx_lm.stream_generate(model, tokenizer, "Tell me a joke"):
print(response.text, end="", flush=True)You can use mlx-flash today to patch mlx-lm scripts or backends.
The ☑ Enable Flash Weight Streaming checkbox is a proposed feature for the official LM Studio MLX engine. See docs/lmstudio_integration.md for the technical blueprint. The checkbox is not yet available in the public release; PRs to lmstudio-ai/mlx-engine are welcome.
Performance varies significantly depending on your SSD speed and unified memory bandwidth:
- M1/M2/M3 Air (Internal NVMe): Expect 0.5 - 1.0 tok/s on 30B models in Streaming mode.
- M4 Pro/Max: High memory bandwidth significantly improves layer transition speeds.
- External Drives: Running models via Thunderbolt RAIDs is viable (~0.1 tok/s); standard USB-C Gen 2 (10Gbps) may be bottlenecked by I/O.
- The Matrix Tool: Always use
scripts/run_matrix_experiments.pyto verify your specific I/O floor.
Detailed milestones are available in ROADMAP.md.
- v0.2.x: Stability, bug fixes, and PyPI release polish.
- v0.3.0: Parallel Expert Streaming for MoE models (Mixtral/DeepSeek).
- v0.3.1: Async background I/O prefetcher.
- v0.3.2: ✅ Disk KV Cache Offloading — production-quality infinite context without OOM.
- v0.4.0: Advanced streaming optimizations and performance tuning.
mlx-flash includes a real-time terminal dashboard to visualize Metal RAM usage and layer-by-layer progress.
# In terminal 1: run your model (e.g., Llama-3-70B)
python examples/quick_start.py --model /path/to/model --flash
# In terminal 2: watch memory and progress in real-time
flash-monitorAdd to any Modelfile for Ollama-compatible frontends:
FROM /path/to/Llama-3.1-70B-Instruct-MLX
# Enable Flash Weight Streaming
FLASH true
FLASH_RAM_GB 10- 🔬 Read our Experimental Findings: Why standard MLX struggles with models larger than RAM.
- 🏗️ Architecture Overview: Deep dive into synchronous evaluation and Metal cache clearing.
- Fork the repository.
- Implement your changes.
- Verify with
pytest tests/. - Open a Pull Request.
Brought to you by ⚡ Flash-Mode Contributors. MIT licensed.