Enable prefix cache reuse for hybrid models (Qwen3.5, Nemotron-H, Mamba) by skcadri · Pull Request #1006 · ml-explore/mlx-lm

skcadri · 2026-03-15T07:34:52Z

Summary

Fixes prefix cache reuse for non-sliding-window hybrid models — the remaining case from #980 that #999 did not address.

Hybrid models (Qwen3.5, Nemotron-H, Jamba, etc.) use both KVCache (attention) and ArraysCache (SSM/recurrent) per layer
ArraysCache stores compressed recurrent state that cannot be trimmed — you can't remove N tokens from a hidden state
can_trim_prompt_cache() returns False → the "longer cache trim" path in fetch_nearest_cache() is blocked → 0% cache reuse on divergent-suffix requests (same system prompt, different user messages)

This PR extends the existing prompt checkpoint mechanism to save the cache at the last-message boundary during prefill. Subsequent requests with the same prefix find it via the "shorter cache" path — no trimming needed.

Approach

For non-trimmable models, _compute_prompt_checkpoint now computes where the last user message begins using a sentinel substitution technique:

Replace the last user message with a short dummy ("x")
Tokenize both the original and sentinel-substituted message lists
Compare token-by-token to find the common prefix — this is the exact boundary before the user message content

This avoids a subtle issue where apply_chat_template(messages[:-1]) produces tokens that don't form a prefix of apply_chat_template(messages) due to template-specific formatting (discovered during testing with Qwen3.5's template which requires a user message and inserts role tokens differently).

For trimmable models (pure attention), the existing <think> token checkpoint logic is completely unchanged.

Changes

mlx_lm/server.py — 1 file, +60/-10 lines:

ModelProvider.load(): Cache _cache_is_trimmable flag at model load time
_compute_prompt_checkpoint(): Branch on trimmability:
- Non-trimmable → sentinel-based last-message boundary checkpoint
- Trimmable → existing <think> token checkpoint (unchanged)
Call site: Pass args to _compute_prompt_checkpoint for chat_template_kwargs access

Test Results

Unit tests

All 20 existing tests pass (python -m pytest tests/test_prompt_cache.py -v).

Qwen3.5-9B (attention + linear recurrence hybrid)

Cache composition: KVCache + ArraysCache, can_trim=False

Agentic workload (~2.7K token system prompt, 5 divergent user messages):

Request	Prompt Tokens	Cached	Hit %	Time	Saved
1 (cold)	2764	0	0%	10.30s	—
2 (warm)	2755	2714	98%	3.07s	+7.22s
3 (warm)	2754	2714	98%	3.59s	+6.71s
4 (warm)	2766	2714	98%	4.27s	+6.03s
5 (warm)	2751	2714	98%	3.47s	+6.82s

Summary: 98% cache hit, 2.86x speedup, 26.78s saved across 4 warm requests.

Multi-turn test (shared [sys, user, assistant] prefix, different follow-up):

Request	Prompt	Cached	Hit %
1 (cold)	63	25	39% (system prefix from earlier)
2 (warm)	62	47	75% (full multi-turn prefix)
3 (warm)	62	47	75%

NVIDIA Nemotron-H 30B-A3B (attention + Mamba SSM hybrid)

Architecture: nemotron_h — 52 layers, MEMEM*EMEMEM*... pattern. Only 6/29 cache layers trimmable (KVCache), 23/29 are ArraysCache (Mamba SSM state). can_trim=False.

Agentic workload (~2.7K token system prompt):

Request	Prompt Tokens	Cached	Hit %	Time
1 (cold)	2769	0	0%	2.38s
2 (warm)	2758	2716	98%	0.77s
3 (warm)	2755	2716	98%	0.76s

Summary: 98% cache hit, 3.1x speedup. Stock LM Studio (unpatched) shows 0% cache hit on the same workload.

Regression: Qwen1.5-0.5B (pure attention, trimmable)

Verified the trimmable model path is unaffected — cache trim still works correctly (69% hit via existing trim path).

Edge cases verified

Non-chat requests (completions API): No crash, checkpoint skipped
Single message (no prefix): Guard catches it, normal flow
Assistant-last message: Returns early, no checkpoint
Concurrent batch requests: Checkpoints saved and reused correctly
Templates requiring user message (Qwen3.5): Sentinel fallback handles gracefully

How It Works

First request [sys, user_1]:

_compute_prompt_checkpoint finds the message boundary via sentinel comparison
Prefill processes system prompt tokens, checkpoint callback saves cache in the trie
Prefill continues with user message tokens, generation completes

Second request [sys, user_2]:

fetch_nearest_cache finds cached [sys] via the shorter path (existing code, line ~303)
Returns cached prefix + rest = user_2_tokens
Only processes the new user message — cache reuse works!

Relationship to #999

PR #999 by @Giustino98 fixed cache reuse for sliding-window models (RotatingKVCache) by enabling trim on wrapped ring buffers. This PR complements it by handling the non-sliding-window hybrid models (ArraysCache) where trimming is fundamentally impossible, using the checkpoint approach instead.

Together they resolve #980 completely.

🤖 Generated with Claude Code

mm65x

the sentinel approach for finding the message boundary is clever. benchmarks look solid too.

a couple of things i noticed while reading through

mm65x · 2026-03-15T14:24:15Z

mlx_lm/server.py

+        if not self.model_provider._cache_is_trimmable:
+            # For hybrid models with non-trimmable caches (ArraysCache),
+            # checkpoint at the last-message boundary so the prefix
+            # can be reused via the shorter-cache path.


_serve_single doesn't call _compute_prompt_checkpoint at all - it only runs in the batch path. is that intentional? if a hybrid model is served without batching (_is_batchable returns False), the checkpoint logic won't kick in and you'd still get 0% cache reuse on the single-serve path.

Good catch. This is a pre-existing limitation — _serve_single doesn't support checkpoints for trimmable models either (the think-token checkpoint only runs in the batch path). Adding checkpoint support there would require splitting the prefill in stream_generate, which is a larger change.

That said, _serve_single still benefits from checkpoints saved by prior batched requests — if a batched request already checkpointed the prefix, a subsequent single-serve request will find it via fetch_nearest_cache.

mm65x · 2026-03-15T14:24:15Z

mlx_lm/server.py

+            if len(prefix_messages) > 0:
+                chat_template_args = self.model_provider.cli_args.chat_template_args
+                if args is not None and hasattr(args, 'chat_template_kwargs') and args.chat_template_kwargs:
+                    chat_template_args = chat_template_args.copy()


this guard chain is a bit defensive - args is always passed from the call site now (line 905), and chat_template_kwargs is set by _parse_args. could simplify to just getattr(args, 'chat_template_kwargs', None).

Simplified in 82574b4 — now uses getattr(args, 'chat_template_kwargs', None).

dae · 2026-03-16T02:19:06Z

I tried this with parallel requests, but it seems to trigger a bug:

  File "/Users/dae/Local/python/mlx/lib/python3.12/site-packages/mlx_lm/generate.py", line 1257, in _next
    batch = self._process_prompts(prompts)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dae/Local/python/mlx/lib/python3.12/site-packages/mlx_lm/generate.py", line 1124, in _process_prompts
    last_inputs = mx.array([p[-prompt_checkpoint:] for p in inputs])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Initialization encountered non-uniform length.

Hybrid models like Qwen3.5 and Nemotron-H use both KVCache (attention) and ArraysCache (SSM/recurrent) layers. Since ArraysCache stores compressed recurrent state that cannot be trimmed, can_trim_prompt_cache() returns False, blocking the "longer cache trim" path in fetch_nearest_cache() and resulting in 0% cache reuse on the divergent-suffix pattern (same system prompt, different user messages). PR ml-explore#999 fixed this for sliding-window models (RotatingKVCache). This commit addresses the remaining non-sliding-window hybrid models by extending the existing prompt checkpoint mechanism to save the cache at the last-message boundary during prefill, enabling subsequent requests to find it via the "shorter cache" path — no trimming needed. The boundary is found using a sentinel substitution technique: the last user message is replaced with a short dummy, both versions are tokenized, and the common prefix identifies the exact split point. This avoids issues where apply_chat_template(messages[:-1]) produces tokens that don't form a prefix of apply_chat_template(messages). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Left-pad last_inputs in _process_prompts to handle variable-length checkpoint offsets across batched requests (fixes ValueError with non-uniform length when parallel requests have different user message lengths) - Simplify chat_template_kwargs guard to use getattr Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

skcadri · 2026-03-16T23:31:43Z

@dae Thanks for the repro — this was a real bug. When parallel requests with different user message lengths get batched, prompt_checkpoint = max(all_checkpoints) could exceed the shorter input's length, causing mx.array() to fail on ragged lists.

Fixed in 82574b4 by left-padding last_inputs in _process_prompts (same pattern as _left_pad_prompts uses for the main inputs). This is technically a pre-existing issue in generate.py but only became reachable with variable-length checkpoints.

Could you try again with the latest push?

dae · 2026-03-17T04:49:13Z

I don't get an error any more, but I still get strange results with concurrency. My use case is a browser ext sending a 'translate this' system prompt and then about 300 tokens. The first 'translate page' fires off about 6 parallel requests, and things perform fine as there's no cache at that point. But when I repeat it, I get output that implies some of the system prompt has been lost or corrupted. Adding --prompt-concurrency 1 to startup params makes the odd behaviour go away.

Edit: I've observed (less common) strange behaviour on subsequent runs even with --prompt-concurrency 1.

celestial-rose · 2026-03-18T14:21:11Z

Hey ! thanks for the PR; pretty interested in these fixes!
Will this enable MLX Parallel Queries in LM Studio by any chance ?

The checkpoint_callback used the batch-global prompt_end (max checkpoint offset across all batched requests) to compute the checkpoint cache key. For requests with shorter user messages, this cut into the shared prefix, storing a key that didn't match the cache state — causing cache corruption on subsequent requests. Fix: store per-request checkpoint_position and only save checkpoints for requests whose individual offset matches the batch-global prompt_end. Requests with shorter offsets safely skip the checkpoint (they'll get their own checkpoint when they're the longest in a future batch, or via sequential processing). Tested with parallel requests of varying user message lengths on Qwen3.5-9B: all requests return correct answers with cache hits, no corruption after repeated bursts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

skcadri · 2026-03-18T23:28:23Z

@dae Great catch — found the root cause. The checkpoint_callback was using the batch-global prompt_end (the max checkpoint offset across all batched requests) to compute the checkpoint cache key. For requests with shorter user messages, this cut into the shared system prompt prefix, storing a truncated key paired with cache state that was too short. Subsequent requests matching that corrupted trie entry would resume from a wrong state — explaining the "system prompt lost or corrupted" behavior.

Fixed in c2e01bc: each request now tracks its own checkpoint_position, and the callback only saves a checkpoint when the request's individual offset matches the batch-global prompt_end. Requests with shorter offsets safely skip (they'll checkpoint in a future batch or via sequential processing).

Stress-tested with parallel bursts of varying user message lengths on Qwen3.5-9B — all requests return correct answers with cache hits across repeated bursts, no corruption.

The --prompt-concurrency 1 workaround would have masked this because with a single request per batch, the global offset always equals the per-request offset.

Could you try the latest push and see if the corruption is gone for your translation use case?

skcadri · 2026-03-18T23:28:30Z

@celestial-rose Thanks for the interest! This PR specifically targets the mlx-lm server's prefix caching for hybrid models (Qwen3.5, Nemotron-H, Jamba, etc.). LM Studio uses its own server implementation, so it wouldn't directly benefit from this change — but if LM Studio's backend builds on mlx-lm, the cache improvements here would carry over.

For parallel queries in mlx-lm's own server: yes, this PR + the corruption fix in c2e01bc make concurrent requests with shared prefixes work correctly on hybrid models.

Thump604 · 2026-03-21T20:45:05Z

This solves a real problem. We run Nemotron-3-Super-120B-A12B (40 Mamba + 8 Attention + 40 MoE layers) on M2 Ultra and get 0% cache reuse on the current main because can_trim_prompt_cache() returns False for ArraysCache layers. The sentinel-based message boundary detection is the right approach — we hit the same issue with Qwen 3.5 templates that require a user message (naive messages[:-1] produces different tokens than the prefix of apply_chat_template(messages)).

The batch concurrency fix for checkpoint_callback using per-request checkpoint_position is important — without it, variable-length prompts in a batch produce corrupted trie entries.

Note: _compute_prompt_checkpoint signature change (args=None) will conflict with PR #1030 which modifies the same function. Whoever lands first, the other needs a rebase.

skcadri · 2026-03-22T16:05:22Z

@Thump604 Great to hear this helps Nemotron-3-Super on M2 Ultra — that's a perfect test case with the mixed Mamba/Attention/MoE layer composition.

Good catch on the #1030 conflict. Will keep an eye on it and rebase if #1030 lands first (the overlap is just the _compute_prompt_checkpoint signature, so it should be a clean rebase either way).

angeloskath · 2026-03-30T11:31:28Z

#911 and #1072 should be fixing this fully. Please provide a snippet on top of #1072 that shows the cache not being used as it should be and we can open an issue.

skcadri · 2026-03-30T13:38:17Z

@angeloskath Thanks for the pointer — tested #1072 with Qwen3.5-9B (hybrid: 75% ArraysCache + 25% KVCache) and it fully handles prefix caching for non-trimmable models. Both sequential cache reuse and parallel requests with varying user message lengths work correctly, including the translation scenario that @dae reported corruption on.

Results on step-by-step-gen branch:

Sequential: cached=19 on system prompt reuse ✓
3 parallel requests: all get cached=19 ✓
6 parallel translation requests (dae's scenario): Round 2 shows cached=150-169 across all requests, no corruption ✓

Our PR is redundant with #1072. Closing this — the refactored batch generation with arbitrary segment checkpointing is the cleaner solution. Thanks for the great work on #1072!

skcadri mentioned this pull request Mar 15, 2026

Prefix cache reuse is broken for all hybrid-architecture models (sliding window, SSM/Mamba) #980

Open

mm65x reviewed Mar 15, 2026

View reviewed changes

Sohaib Qadri and others added 2 commits March 16, 2026 16:30

skcadri force-pushed the fix/prefix-cache-non-trimmable-hybrid branch from ec0434b to 82574b4 Compare March 16, 2026 23:30

cubist38 mentioned this pull request Mar 17, 2026

Cache/hybrid attention models cubist38/mlx-openai-server#251

Merged

Thump604 mentioned this pull request Mar 21, 2026

Fix per-request enable_thinking toggle in server #1030

Open

lyonsno mentioned this pull request Mar 23, 2026

Fix zero prompt-cache reuse for thinking models in multi-turn chat #1042

Open

skcadri closed this Mar 30, 2026

Conversation

skcadri commented Mar 15, 2026

Summary

Approach

Changes

Test Results

Unit tests

Qwen3.5-9B (attention + linear recurrence hybrid)

NVIDIA Nemotron-H 30B-A3B (attention + Mamba SSM hybrid)

Regression: Qwen1.5-0.5B (pure attention, trimmable)

Edge cases verified

How It Works

Relationship to #999

Uh oh!

mm65x left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mm65x Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skcadri Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

mm65x Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skcadri Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

dae commented Mar 16, 2026

Uh oh!

skcadri commented Mar 16, 2026

Uh oh!

dae commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

celestial-rose commented Mar 18, 2026

Uh oh!

skcadri commented Mar 18, 2026

Uh oh!

skcadri commented Mar 18, 2026

Uh oh!

Thump604 commented Mar 21, 2026

Uh oh!

skcadri commented Mar 22, 2026

Uh oh!

angeloskath commented Mar 30, 2026

Uh oh!

skcadri commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mm65x left a comment •

edited

Loading

mm65x Mar 15, 2026 •

edited

Loading

mm65x Mar 15, 2026 •

edited

Loading

dae commented Mar 17, 2026 •

edited

Loading