Skip to content

matt-k-wong/mlx-flash

Repository files navigation

mlx-flash ⚡

Flash Weight Streaming for MLX — run models larger than your RAM on Apple Silicon. 30B on 16 GB, 70B+ on 32 GB+. No additional quantisation — uses the model's native precision.

Project Lineage: This implementation is inspired by Apple Research's paper LLM in a Flash (arXiv 2312.11514), which formalized the concept of using the OS page cache for efficient weight streaming. The original flash-moe project provided the first Objective-C + Metal proof of concept for this approach on Apple Silicon. This repository (mlx-flash) extends those principles to the Python-based MLX ecosystem, providing a robust, duck-typed integration layer for mlx-lm.

License: MIT Python 3.11+ MLX macOS 13+ Tests


Table of Contents

  1. Why Flash Mode?
  2. How It Works
  3. Architecture Diagrams
  4. Performance
  5. Output Quality
  6. Quick Start
  7. LM Studio Usage
  8. Modelfile Usage
  9. Technical Deep Dive
  10. Contributing

Why Flash Mode?

Model Hardware Mode Load Time Peak Weight RSS Result
Nemotron-30B (17.8 GB) 16GB MacBook Air Normal 4.1s 18+ GB (Swap) ❌ Laggy
Nemotron-30B (17.8 GB) 16GB MacBook Air Flash 0.8s ~0.5 GB ✅ Smooth

Important

Flash Mode is strictly for models that are larger than your RAM.
It allows you to run models of any size (30B, 70B, even 1T+) on base-spec Macs by streaming weights directly from your SSD.

The secret: Synchronous Layer Evaluation. Standard MLX uses "lazy graph evaluation," which attempts to build a massive graph spanning all layers before execution. This causes Metal to attempt allocating all weights at once, leading to OOM.

mlx-flash bypasses this by:

  1. Loading weights as lazy mmap-backed arrays via mlx_lm.load(path, lazy=True).
  2. Intercepting the forward pass to execute one layer at a time.
  3. Forcing materialization via mx.eval() + mx.synchronize() after each layer.

The Tradeoff: Quality vs. Speed

mlx-flash is a no-compromise quality engine. Unlike other low-RAM solutions, we do not use lossy compression (like 4-bit or 2-bit quantization) to shrink the model into your RAM. Instead, we trade Time for Capacity.

1. Zero Accuracy Loss 🏆

  • Bit-for-Bit Identical: Weights are streamed in their native precision (F16/BF16/F32). The tokens generated are identical to running the model on a $6,000 Mac Studio with 192GB of RAM.
  • High-Precision KV Cache: DiskKVCache stores context on your SSD at full precision, avoiding the logic degradation common in 4-bit KV cache implementations.
  • Deterministic Sampling: Supports the full mlx-lm sampling suite with perfect reproducibility.

2. The Master Dial: ram_budget_gb 📈

Flash Mode is a hybrid engine. You control exactly how much RAM is traded for speed.

  • Safety Profile (1.0 - 2.0 GB): "Slow but Invincible." Forces strict weight streaming. Recommended for 30B+ models on limited RAM.
  • Balanced Profile (4.0 - 8.0 GB): "Smart Mode." Caches weights in RAM when possible (10x speedup), but automatically triggers streaming if memory pressure spikes.
  • Performance Profile (12.0+ GB): Keeps most of the model in RAM. Maximum speed, only uses Flash Mode for the "overflow."

Tip

Use the Experimental Matrix: Run python scripts/run_matrix_experiments.py to find the exact "Performance Cliff" for your specific model and hardware.

Tip

If your model is "Killed" (Exit code 137): This means your ram_budget_gb + OS overhead exceeded your physical RAM. Lower the budget by 1.0 GB and restart.

3. Infinite Context (Disk KV Cache) ♾️

  • Bottomless Window: Your prompt size is limited only by your SSD free space, not your RAM.
  • Eviction Mode: To prevent SSD bloat, we use a Halving Eviction policy by default. When the max_tokens limit is reached, it keeps the most recent 50% of the context. (Set max_tokens=None for absolute perfect recall).

How It Works

graph TD
    A[SSD: .safetensors] --"mmap(lazy=True)"--> B[MLX Lazy Arrays]
    A --"Background Thread"--> P[os.pread]
    P --"Force OS Page Cache"--> B
    B --"FlashLLM Wrapper"--> C{Forward Pass}
    subgraph "Per-Layer Loop"
        C --"Layer i"--> D[mx.eval]
        D --"Sync GPU"--> E[mx.synchronize]
        E --"MADV_FREE"--> G[Release RAM]
        G --"Next Layer"--> C
    end
    C --"Final Output"--> G2[Token]
Loading

The Mechanism

  1. Lazy Loading: mlx_lm.load(path, lazy=True) maps the entire model into the unified address space using the macOS page cache. No Metal RAM is consumed at this point.
  2. Async Background Prefetch: We extract exact byte offsets from .safetensors headers and use a background Python thread to explicitly os.pread pages directly into RAM. This bypasses the Python GIL and hides SSD latency from the GPU.
  3. Pipelined Execution: Instead of building a unified lazy graph for the whole model (which leads to OOM), we build and evaluate a graph for exactly one layer. CPU and GPU syncs are pipelined to maximize throughput.
  4. Immediate Eviction: After each mx.eval(), we verify completion, clear the Metal cache, and issue madvise(MADV_FREE). The weights for the current layer are immediately flushed from unified RAM by macOS.
  5. Efficiency Features: Since FlashLLM is a drop-in proxy, you get all native mlx-lm features like quantized KV cache (kv_bits) and sliding windows (max_kv_size) for free.

Architecture Diagrams

1 · System Architecture (Current)

graph TB
    subgraph UI["Execution Interface"]
        CL["Python Script / CLI"]
        GS["FlashGenerationLoop"]
    end

    subgraph CORE["mlx-flash"]
        FM["FlashManager"]
        FLLM["FlashLLM Wrapper\n(Duck-Typed Layer Interceptor)"]
        BW["BackgroundPrefetcher\n(Thread drops GIL)"]
        PC["page_cache.py\nmadvise(MADV_FREE)"]
    end

    subgraph METAL["Metal Runtime"]
        LA["Lazy Arrays\n(mmap-backed)"]
        EV["mx.eval()"]
    end

    CL --> GS
    GS --> FM
    FM --"lazy=True"--> LA
    GS --"forward"--> FLLM
    FLLM --"enqueue chunk"--> BW
    BW --"os.pread"--> LA
    FLLM --> EV
    EV --> CL_C
    CL_C --> PC
Loading

2 · Future Roadmap (MoE Expert Streaming)

flowchart LR
    subgraph RT["Router Pass (always hot)"]
        TOK["Token batch"] --> RW["Router weights"]
        RW --> TK["Top-K Experts"]
    end

    subgraph IO["Parallel Expert I/O"]
        TK --> P0["madvise\nExpert 0"]
        TK --> P1["madvise\nExpert 1"]
    end

    subgraph GPU["GPU Compute"]
        P0 & P1 --> COM["Sync Combine"]
    end

    COM --> OUT["Output"]
Loading

Robustness & Stability: "Slow but Invincible"

mlx-flash is engineered for unbreakable reliability on limited hardware. While standard MLX may crash with Insufficient Memory as the context grows, mlx-flash maintains a rigid, deterministic memory footprint.

  • Deterministic RAM: By forcing a synchronization and cache clearing after every layer, we ensure that a 70B model uses no more peak RAM than a 7B model.
  • Page-Cache Resilience: We leverage the macOS kernel's virtual memory system to "page" context from the SSD. If the SSD is slow, the model simply waits; it never crashes.
  • Bit-for-Bit Parity: There is zero "Accuracy Tax." You are running the original high-precision weights with the original sampling logic.

Benchmarking (v0.3.1)

Benchmarked on M4 MacBook Air 16 GB (Internal NVMe). Synthetic 1.5B Llama-style model with 0.1 GB RAM Budget (Extreme Stress Test).

Context Length Generation Speed SSD KV Cache Peak RAM Overhead
512 Tokens 64.1 T/s 16 MB ~24 MB
8,192 Tokens 26.1 T/s 256 MB ~256 MB
32,768 Tokens 9.8 T/s 1.02 GB ~816 MB

Tip

Scaling Limit: Decode speed is $O(N^2)$ due to attention math. At 32k tokens, the CPU/GPU must track ~1GB of context data per layer. Use a Thunderbolt RAID 0 array to minimize the I/O latency floor.


Performance

Benchmarked on M4 MacBook Air 16 GB with internal NVMe. With v0.2 Async I/O Prefetching enabled, the OS pulls data from the SSD in the background, keeping the GPU constantly saturated.

Model File Size Flash Weight RAM + KV Cache (2K ctx) Mode Tok/s (M4 Air)
Qwen2.5-3B 1.9 GB ~0.3 GB ~0.2 GB RAM 60-80
Nemotron-30B 17.8 GB ~0.5 GB ~1.8 GB Stream ~0.7
Nemotron-30B 17.8 GB ~15 GB ~1.8 GB RAM 12.4
Llama-3-70B 40 GB ~0.8 GB ~3.2 GB Stream ~0.05

Note

Tokens per second benchmarks use max_kv_size=2048. Unlimited context lengths will consume more RAM as the KV cache grows.


Known Issues (v0.1.1+)

  • Limited Context RAM (v0.1–v0.3.1): In earlier versions, the KV cache grew in RAM. Use max_kv_size (sliding window) or kv_bits (quantization) to mitigate. Now fixed with Disk KV Cache offloading (v0.3.2+).

Output Quality

No additional quantisation loss

Flash Mode uses the model's weights as-is with no additional quantisation. Output is numerically equivalent to standard mlx-lm inference when using the same model, sampling parameters, and random seed.

Caveat: Per-layer mx.eval() may occasionally produce microscopic differences in floating-point results compared to fused multi-layer evaluation due to the specific order of floating-point operations. In practice, generated text is perceptually identical.

Quick Start

1. Install via pip

pip install mlx-flash

2. Using via Python

from mlx_flash import FlashConfig
from mlx_flash.integration.lmstudio import apply_flash_patch
import mlx_lm

# 1. Enable Flash Mode system-wide for mlx_lm
# ram_budget_gb is your "Master Dial":
# - 1.5: Safe (Works even if RAM is full)
# - 8.0: Performance (Requires free RAM)
apply_flash_patch(FlashConfig(enabled=True, ram_budget_gb=2.0))

# 2. Load any model (e.g., Llama-3-70B on 16GB RAM)
model, tokenizer = mlx_lm.load("mlx-community/Meta-Llama-3-70B-Instruct-4bit")

# 3. Generate — weights will stream automatically
for response in mlx_lm.stream_generate(model, tokenizer, "Tell me a joke"):
    print(response.text, end="", flush=True)

LM Studio Usage

Python Integration (Current)

You can use mlx-flash today to patch mlx-lm scripts or backends.

LM Studio UI (Roadmap)

The ☑ Enable Flash Weight Streaming checkbox is a proposed feature for the official LM Studio MLX engine. See docs/lmstudio_integration.md for the technical blueprint. The checkbox is not yet available in the public release; PRs to lmstudio-ai/mlx-engine are welcome.


Benchmark Your Hardware

Performance varies significantly depending on your SSD speed and unified memory bandwidth:

  • M1/M2/M3 Air (Internal NVMe): Expect 0.5 - 1.0 tok/s on 30B models in Streaming mode.
  • M4 Pro/Max: High memory bandwidth significantly improves layer transition speeds.
  • External Drives: Running models via Thunderbolt RAIDs is viable (~0.1 tok/s); standard USB-C Gen 2 (10Gbps) may be bottlenecked by I/O.
  • The Matrix Tool: Always use scripts/run_matrix_experiments.py to verify your specific I/O floor.

Roadmap

Detailed milestones are available in ROADMAP.md.

  • v0.2.x: Stability, bug fixes, and PyPI release polish.
  • v0.3.0: Parallel Expert Streaming for MoE models (Mixtral/DeepSeek).
  • v0.3.1: Async background I/O prefetcher.
  • v0.3.2: ✅ Disk KV Cache Offloading — production-quality infinite context without OOM.
  • v0.4.0: Advanced streaming optimizations and performance tuning.

Live Memory Monitor

mlx-flash includes a real-time terminal dashboard to visualize Metal RAM usage and layer-by-layer progress.

# In terminal 1: run your model (e.g., Llama-3-70B)
python examples/quick_start.py --model /path/to/model --flash

# In terminal 2: watch memory and progress in real-time
flash-monitor

Modelfile Usage

Add to any Modelfile for Ollama-compatible frontends:

FROM /path/to/Llama-3.1-70B-Instruct-MLX

# Enable Flash Weight Streaming
FLASH true
FLASH_RAM_GB 10

Technical Deep Dive


Contributing

  1. Fork the repository.
  2. Implement your changes.
  3. Verify with pytest tests/.
  4. Open a Pull Request.

Brought to you by ⚡ Flash-Mode Contributors. MIT licensed.