mlx-lm — Qwen 3.6 on Apple Silicon, Done Right

Fork of mlx-explore/mlx-lm focused on making Qwen 3.6 models (35B MoE, 27B) work reliably on M-series Macs for long-running agent workloads like Hermes Agent.

Everything operates on the key-value cache — no model weight changes, no new architectures. The problem: a 35B MoE model on 64GB M1 Max uses ~25-30GB for weights, leaving ~34GB for KV cache. A single 45K-token conversation fills 6-8GB. Long agent sessions run out of RAM, hit swap, and everything slows down.

This fork solves that from four angles.

Quick Start

pip install -e .

mlx_lm.server \
  --model ~/.cache/huggingface/hub/Qwen3.6-35B-A3B-UD-MLX-4bit \
  --host 127.0.0.1 --port 8000 \
  --chat-template-args '{"enable_thinking": false}' \
  --kv-bits "(8, 4)" --kv-group-size "(64, 32)" \
  --block-ssd-cache-dir ~/.cache/mlx-lm/block_ssd_cache \
  --block-ssd-cache-max-size 50 \
  --prompt-cache-size 10

1. KV Cache Quantization — Fit More Context in Less RAM

The problem

At FP16, each KV cache layer costs ~384MB at 128K context. With 10 KV layers on Qwen3.6-35B, that's ~3.8GB just for the cache. Add model weights and you're at 28-30GB — leaving little headroom before swap kicks in.

What this fork does

Asymmetric quantization: keys at higher precision (K8), values lower (V4). Keys participate in the attention dot product — noise there directly shifts which tokens the model attends to. Values are softmax-weighted sums — a low-pass filter that tolerates more compression.

                    FP16    K8+V4    K8+V4 + boundary
KV cache (128K)    3.8 GB   2.7 GB   2.76 GB
Model weights      20 GB    20 GB    20 GB
Peak RAM           ~30 GB   ~28 GB   ~28 GB
Headroom (64GB)    ~34 GB   ~36 GB   ~36 GB

--kv-bits "(8, 4)" --kv-group-size "(64, 32)"

Boundary layers (first 2 + last 2 KV layers) stay at K8+V8 because they handle input embedding projection and final logit transformation where quantization noise compounds. Middle layers operate on already-abstracted representations where V4 is fine.

--kv-boundary-layers 2 --kv-boundary-bits "(8,8)"   # default
--kv-boundary-layers 0                                # disable

Reference: KIVI (NeurIPS 2024).

Why not TurboQuant?

TurboQuant compresses KV cache to ~3 bits using Walsh-Hadamard transforms. For Qwen 3.6 hybrid architectures (full_attention_interval=4), only 10 of 40 layers have KV caches. At 128K context the total KV cache is 1-3GB. Moving from K8+V4 to TurboQuant 3-bit saves ~80MB (35B-A3B) or ~256MB (27B) — not enough to justify the WHT compute overhead and implementation complexity. K8+V4 with boundary protection is the sweet spot for these models.

2. SSD Prompt Cache — Keep Conversations Alive Across Restarts

The problem

The RAM prompt cache (LRUPromptCache) holds KV entries in memory. Restart the server, and the entire cache is gone. Every conversation starts from scratch — full 85-second prefill for a 29K-token system prompt.

What this fork does

Two-tier cache with disk persistence:

fetch_nearest_cache(model, tokens)
  +-- RAM tier (PromptTrie + hot cache)     → hit: deepcopy, 30,000 tok/s
  +-- SSD tier (BlockSSDCache, 256-token blocks) → hit: deserialize + promote

Each 256-token block is a safetensors file on disk, chained via hash: SHA256(parent_hash || model_key || block_tokens). On startup, the 64 most-recent blocks are pre-loaded into an in-memory LRU hot cache (~52ms).

--block-ssd-cache-dir ~/.cache/mlx-lm/block_ssd_cache
--block-ssd-cache-max-size 50   # GB

What it costs

The initial implementation had 4 performance bugs that caused prefill to regress from ~650 tok/s to 4.5 tok/s. All fixed:

Bug	Root Cause	Fix
No hot cache	Every SSD hit read from disk	LRU `_hot_cache` in RAM
Per-block mx.eval	120 GPU sync points during prefill	Batch eval after all blocks loaded
Deep copy overhead	`copy.deepcopy()` per block per layer	Direct `mx.load()` without intermediary
Eviction thrashing	`_maybe_evict()` called per block save	Deferred to post-request flush

3. Prefill Reliability — Making Caching Actually Work with Quantized KV

Caching with asymmetric KV quantization is tricky. Several upstream issues caused silent cache corruption or missed reuse:

Prompt-tag bug — max_tokens truncation injected wrong tags into cached prefixes, poisoning subsequent requests (5-part fix across _serve_single, _tokenize, and segment handling)
ArraysCache trim — multimodal models use ArraysCache for non-KV layers, but trim() wasn't implemented, causing crashes on prefix trim
pop_prefixes checkpoints — intermediate system/user checkpoints were lost during prefix operations, breaking the trie invariant
max_tokens default — missing max_tokens in the API request caused silent response truncation

These fixes ensure that cache hits return correct results, not just fast ones.

4. Observability — See What the Cache Is Doing

Every request logs a single PERF line with full cache health:

PERF: prompt_tps=2330.4 prompt_tok=29624 pref_tok=3256 prefill=12.71s
      block_hit=103 block_write=0 chain_break=0

Field	Meaning
`pref_tok`	Tokens actually recomputed (not served from cache)
`block_hit`	Blocks loaded from SSD this request
`block_write`	New blocks written to SSD
`chain_break`	1 = block-0 prefix mismatch (prompt drifted)

# Find cache regressions
grep "PERF:" server.log | grep "chain_break=1"

# Find requests with no cache reuse
grep "PERF:" server.log | grep "pref_tok=[0-9]\{4,\}"

Benchmarks

Hardware: M1 Max 64GB | Model: Qwen3.6-35B-A3B-UD-MLX-4bit

Cross-session SSD cache (Hermes Agent, 29K-token system prompt)

	prefill	tok/s	cached
Cold (no SSD)	85.01s	348	0%
SSD hit (Run 2)	12.71s	2,330	89%
Within-session	0.90s	30,101	99.9%

6.7x cross-session prefill speedup. 538x within-session.

Long conversation (21 requests, 29K → 75K tokens)

prompt_tok	pref_tok	cached	prefill
29,629	3,261	89%	12.78s
33,019	3,360	90%	13.22s
39,943	3,318	92%	15.01s
47,165	3,672	92%	17.83s
63,297	302	99.5%	2.20s
75,641	1,143	98.5%	7.37s

Prefill stays flat (~13-18s) despite prompt growing to 47K tokens. chain_break=0 on every request — prompt byte-stable.

Full data: docs/SSD_CACHE_BENCHMARK_FIX_ABC.md

Recommended Flags

These are the flags that make this fork worth using over upstream. Copy-paste for a production Hermes agent setup on M1 Max 64GB:

mlx_lm.server \
  --model ~/.cache/huggingface/hub/Qwen3.6-35B-A3B-UD-MLX-4bit \
  --host 127.0.0.1 --port 8000 \
  --chat-template-args '{"enable_thinking": false}' \
  --kv-bits "(8, 4)" --kv-group-size "(64, 32)" \
  --kv-boundary-layers 2 --kv-boundary-bits "(8, 8)" \
  --block-ssd-cache-dir ~/.cache/mlx-lm/block_ssd_cache \
  --block-ssd-cache-max-size 50 \
  --prompt-cache-size 10

Flag	Value	What it does	Why you want it
`--kv-bits`	`"(8, 4)"`	Keys at 8-bit, values at 4-bit	~30% KV memory savings. Keys need precision (attention dot product), values tolerate compression (softmax-weighted sums).
`--kv-group-size`	`"(64, 32)"`	Separate group sizes for K and V	Smaller groups for V (32) compensate for lower bit-width. Keys use 64 for speed.
`--kv-boundary-layers`	`2`	Protect first 2 + last 2 KV layers at K8+V8	Boundary layers handle input projection and logit output — most sensitive to quantization noise.
`--block-ssd-cache-dir`	—	Persist KV blocks to SSD	Survives server restarts. Without this, every restart recomputes the full system prompt from scratch.
`--block-ssd-cache-max-size`	`50`	SSD cache limit in GB	A 29K-token system prompt uses ~40GB in blocks. 50GB gives headroom for a few sessions.
`--prompt-cache-size`	`10`	Keep N most-recent KV caches in RAM	Higher values use more RAM but reduce SSD round-trips. 10 is enough for typical agent sessions.

What's Next: DFlash

DFlash models are distilled/optimized variants of Qwen 3.5 and 3.6, available in 27B and 35B-A3B sizes. The SSD caching infrastructure in this fork makes iterative testing on DFlash variants practical — long benchmark sessions survive server restarts without losing prefix cache.

Models in cache:

Qwen3.5-27B-DFlash
Qwen3.5-35B-A3B-DFlash
Qwen3.6-35B-A3B-DFlash

Attribution

ml-explore/mlx-lm — upstream
KIVI (Liu et al., NeurIPS 2024) — asymmetric quantization
TurboQuant+ — boundary-aware compression
jundot/omlx — paged_ssd_cache.py reference
Goose #4610 — timestamps preventing LLM caching
EPIC (2410.15332) — position-independent caching
Prompt Cache (2311.04934) — modular attention reuse
CacheBlend (2405.16444) — selective KV recomputation

Name		Name	Last commit message	Last commit date
Latest commit History 886 Commits
.github		.github
benchmarks		benchmarks
docs		docs
mlx_lm		mlx_lm
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ACKNOWLEDGMENTS.md		ACKNOWLEDGMENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mlx-lm — Qwen 3.6 on Apple Silicon, Done Right

Quick Start

1. KV Cache Quantization — Fit More Context in Less RAM

The problem

What this fork does

Why not TurboQuant?

2. SSD Prompt Cache — Keep Conversations Alive Across Restarts

The problem

What this fork does

What it costs

3. Prefill Reliability — Making Caching Actually Work with Quantized KV

4. Observability — See What the Cache Is Doing

Benchmarks

Cross-session SSD cache (Hermes Agent, 29K-token system prompt)

Long conversation (21 requests, 29K → 75K tokens)

Recommended Flags

What's Next: DFlash

Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mlx-lm — Qwen 3.6 on Apple Silicon, Done Right

Quick Start

1. KV Cache Quantization — Fit More Context in Less RAM

The problem

What this fork does

Why not TurboQuant?

2. SSD Prompt Cache — Keep Conversations Alive Across Restarts

The problem

What this fork does

What it costs

3. Prefill Reliability — Making Caching Actually Work with Quantized KV

4. Observability — See What the Cache Is Doing

Benchmarks

Cross-session SSD cache (Hermes Agent, 29K-token system prompt)

Long conversation (21 requests, 29K → 75K tokens)

Recommended Flags

What's Next: DFlash

Attribution

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages