Fix zero prompt-cache reuse for thinking models in multi-turn chat#1042
Fix zero prompt-cache reuse for thinking models in multi-turn chat#1042lyonsno wants to merge 8 commits intoml-explore:mainfrom
Conversation
Pin behavioral contracts for review findings: checkpoint persistence through repeated extraction, partial rewind safety on longer hits, refcount lifecycle, deepcopy failure resilience, single-token shorter match threshold, prefix non-eviction on longer insert, and checkpoint localization suppression at prompt boundaries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Non-thinking models get no benefit from checkpoint caching (their cache keys don't diverge between turns), so storing checkpoint entries is pure memory overhead. Gate checkpoint creation on tokenizer.has_thinking to eliminate unnecessary cache growth for standard models. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
There is partial overlap with #1006, which addresses one slice of this problem space (0% cache reuse for hybrid models with non-trimmable cache types). Both PRs were developed independently in response to #980. #1006 handles the non-trimmable checkpoint case. This PR also handles thinking-token divergence (where chat templates strip Benchmark methodology and results are in the PR description. Harness source is at lyonsno/mlx-lm-benchmarks. |
Non-thinking models with non-trimmable caches (ArraysCache) need the checkpoint entry to enable cache reuse via the shorter-cache path. The early return for non-thinking models was a regression from upstream behavior where _compute_prompt_checkpoint always returns (True, -1) for user-terminal chat requests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Nice work. We run Qwen3.5-122B with thinking enabled on M2 Ultra 128GB and the 0% cache reuse on multi-turn is one of our top pain points. The checkpoint-at-think-boundary approach is the right call. Code review notes: Strengths:
Minor concerns:
MTP compatibility: The changes to We can benchmark on the 122B (M2 Ultra 128GB) once this is ready to test. Happy to report TTFT at turn 5/10/20 with thinking mode on Qwen3.5. |
|
Production TTFT data from M2 Ultra 128GB, Qwen3.5-122B-A10B 5-bit, thinking mode enabled. 10-turn coding conversation (progressive refinement of a Python module):
TTFT growth factor: 0.95x (flat, not growing). We're running vllm-mlx's prefix cache, not mlx-lm's LRUPromptCache, so the comparison isn't direct. But it confirms the problem: turns with heavy thinking content (1-2, 6-10) show 25-28s TTFT on 122B. The checkpoint approach in this PR would directly address the thinking-token divergence causing cache misses. We'll do a proper A/B test with the full PR applied when it's closer to merge. For now the cache.py changes (BatchRotatingKVCache rewind/trim fix) are applied to our editable mlx-lm without issues. |
Metal allocator non-determinism causes the prompt-path subtest to flake at 1.35x. A real memory leak over 120 steps would be 10x+, so 2.0x still catches the failure mode without false positives. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address reviewer feedback from PR ml-explore#1042: - CacheEntry.count → ref_count: the field is decremented on extraction, so it's a reference count, not an insertion counter. - Add default rewind() on _BaseCache and a _has_rewind_impl() helper that uses method identity to detect real overrides. This replaces the inline introspection in _can_rewind_layer_cache with a cleaner helper while preserving the same behavior: third-party _BaseCache subclasses that implement rewind() participate automatically without needing an explicit opt-in flag. - Add targeted tests for the _has_rewind_impl contract covering base class, no-override subclass, custom override, and BatchRotatingKVCache. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for the thoughtful review. I incorporated the two code-level suggestions that seemed worth tightening. For the rewind-path check, I refactored the I also renamed On shorter-prefix retention, I left the behavior unchanged. That was a deliberate tradeoff rather than an oversight: checkpoint entries need shorter prefixes to coexist, and LRU eviction already bounds the pressure. If someone later shows a workload where coexisting shorter prefixes measurably hurt hit rate or memory behavior, I'd be happy to revisit that separately. Good to hear that the fallback path through |
|
Merged this PR into our editable mlx-lm (feat/mtp-native branch, Qwen3.5-122B). The merge was clean -- one trivial conflict in qwen3_5.py (our MTP 15-turn benchmark on M2 Ultra 128GB, Qwen3.5-122B-A10B 5-bit, thinking enabled, vllm-mlx BatchedEngine:
TTFT growth factor: 0.64x (TTFT not growing across turns). Note: TTFTs elevated vs earlier run due to concurrent load testing sharing the server. The For a proper isolated A/B (with vs without PR, same workload, no concurrent load), we can run that when things quiet down. The data above confirms correctness on 122B with thinking mode. |
|
Clean 15-turn benchmark (solo, no concurrent load). M2 Ultra 128GB, Qwen3.5-122B-A10B 5-bit, thinking enabled, vllm-mlx BatchedEngine:
TTFT growth factor: 0.68x. TTFT plateaus around 26s for heavy turns (1500+ chars content) and drops to ~5s for short responses. No degradation across turns. This is through vllm-mlx's prefix cache (not mlx-lm's LRUPromptCache), so the checkpoint mechanism in this PR isn't directly exercised. For a proper A/B with mlx-lm's server, I'll need to temporarily swap servers. Can do that in a follow-up session. The cache.py changes (rewind/trim) are confirmed working on 122B MTP with thinking enabled. No issues. |
PR #1042 A/B Test: mlx-lm server on 122B (as requested)Here's the proper A/B using mlx-lm's server (not vllm-mlx), exercising the LRUPromptCache checkpoint logic directly. Setup: M2 Ultra 128GB, Qwen3.5-122B-A10B 5-bit,
TTFT growth factor: 0.28x (sublinear -- prefix cache is working across turns). The key signal: Turn 1 cold-starts at 36.6s, Turn 5 drops to 3.7s (10x improvement from cache hit), and by Turn 15 TTFT is only 10.4s despite accumulating ~60K chars of conversation context. Without PR #1042, thinking models would get 0% cache reuse and TTFT would grow linearly with conversation length. Bug found and fixedDuring testing, # Line 347 - `rewind` was never defined in scope
if not callable(is_trimmable) or (not callable(trim) and not callable(rewind)):Fix: add |
Code audit: additional findingsWhile testing on 122B, I audited the prompt cache and batch paths more carefully. Beyond the NameError fix reported above, here are findings worth addressing: 1.
|
_rewind_layer_cache now checks _has_rewind_impl() before calling rewind(), matching _can_rewind_layer_cache's logic. Previously, _BaseCache subclasses with only trim() (KVCache, RotatingKVCache) would have _can_rewind say yes but _rewind fail via the stub, wasting a deepcopy. Also adds missing getattr for rewind in the legacy fallback of _can_rewind_layer_cache. Six agreement tests added covering legacy non-BaseCache caches, BaseCache-with-trim-no-rewind, and BatchRotatingKVCache. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for both the isolated On the audit findings: 1 & 4 ( 2 ( 3 ( And thanks again for digging deeply enough to catch the legacy fallback |
|
@angeloskath @awni — this PR has been open since March 18 with no maintainer review. The author (lyonsno) has been very responsive to feedback, addressing all code review findings promptly including bug fixes with new test coverage. Is there a concern with the approach or scope? Happy to help if anything is needed to move this forward. Context: thinking models currently get 0% prompt cache reuse on multi-turn conversations because thinking tokens diverge between turns. This PR fixes that with checkpoint-based rewind. We validated on 122B production (0.28x TTFT growth factor across 15 turns) and found + reported 5 bugs during testing, all addressed by the author. |
The problem
On
main, thinking models get zero prompt-cache reuse in multi-turnchat. This affects any model whose cache layers include non-trimmable types.
In practice, this means Qwen 3.5, Step 3.5 Flash, GPT-OSS, and similar
models are stuck at 0% cache reuse in multi-turn chat.
The root cause: when a thinking model responds, the server caches
prompt + <think>...</think> + responseas the key. On the next turn, thechat template strips the thinking content, so the new prompt diverges from
the cached key at the first thinking token. The server finds the longer
cache and tries
can_trim_prompt_cache— but that requires all layersto be trimmable, and
ArraysCachereturnsis_trimmable=False. Result:the longer cache can't be trimmed, there's no shorter match, and the entire
prompt is re-prefilled from scratch.
Measured impact (Qwen 3.5 35B-A3B BF16, 40-turn conversation, 64 tok/turn):
What this PR does
This PR restores prompt-cache reuse in cases that currently degenerate to full prefill for thinking and hybrid-attention models.
1. Checkpoint caching — After processing a chat turn whose last message
is from the user, the server saves a KV cache checkpoint just before the
thinking-start token. On the next turn, the checkpoint matches the prompt
prefix (before thinking was stripped), enabling reuse regardless of cache
layer types. Checkpoints are stored in the existing
LRUPromptCacheandevicted under normal LRU pressure — no additional unbounded state is
introduced.
2. Rewind-based longer-cache reuse —
fetch_nearest_cachecan nowrewind a longer cached prompt to match a shorter request using per-layer
can_rewind/rewindinstead of the all-or-nothingcan_trim_prompt_cache.This enables cache reuse for
BatchRotatingKVCache(sliding-windowattention) and falls back gracefully through the legacy
is_trimmable/trimcontract for other cache types. Rewind is only applied when per-layer
invariants guarantee equivalence with prefix truncation; otherwise we fall
back to a cache miss. A fail-closed precheck avoids deepcopy on guaranteed
misses.
3. Refcounted extraction — Cache entries track an insertion count.
Checkpoint entries are persistent (extraction deepcopies, original stays),
so a checkpoint survives across turns until evicted by LRU pressure.
Regular entries are consumed on last reference. This prevents handing out
the same mutable cache to concurrent requests.
4. Batch checkpoint compatibility —
BatchGenerator.insert()acceptsper-prompt checkpoint positions. Prompts whose checkpoint tails are
incompatible are truncated to the largest compatible prefix rather than
silently generating invalid checkpoints.
Benchmark results
All benchmarks: MacBook Pro 16" M4 Max (40-core GPU), 128 GB unified memory.
40-turn multi-turn chat (Qwen 3.5 35B-A3B BF16, 64 tokens generated
per turn, short user messages,
LRUPromptCachedefaultmax_size=10):Overall: this PR skips 61,116 of 64,434 prompt tokens (94.9%). Upstream
skips 0.
Wall-clock timing (same 40-turn run, measured prefill and generation
separately):
4x prefill speedup, 34% reduction in total time. Generation speed is
unaffected by checkpoint caching.
Multi-model comparison (6 turns, 256 tokens generated per turn):
Branching conversation (shared prefix, diverging follow-ups):
For non-thinking full-attention models, cache reuse is unchanged — upstream
already handles this well. The improvement targets thinking models and
hybrid attention architectures, which currently get no reuse at all.
Memory overhead
Both branches hold 10 LRU entries at steady state — upstream already pays
the cost of caching completions it can never reuse. The additional memory
from checkpoint entries is ~35 MB (< 3% overhead at turn 40). Cache
entry count stays flat at 10 from turn 5 onward; memory grows sub-linearly
as older entries are evicted and replaced by slightly larger ones reflecting
the growing conversation context.
Checkpoint creation is gated on
tokenizer.has_thinking. Non-thinkingmodels create no checkpoint entries — verified on Qwen 0.5B over 30 turns
with identical cache entry counts, memory usage, and hit rates vs upstream.
Behavioral changes from upstream
LRUPromptCache.insert_cacheno longer evicts shorter prefix entrieswhen a longer entry is inserted. Both coexist under normal LRU pressure.
This is required for checkpoint entries which must not be displaced by
their longer completions.
where previously they deepcopied and left the entry in place.
_searchreturns shorter matches at index 0 (single-token prefix).Previously required index > 0.
_lazy_extract_cache(generator for lazy cache extraction) replaced witheager list comprehension to avoid late-binding issues in checkpoint
callbacks.
Size
~330 lines of production changes across 3 files, ~1,800 lines of test
coverage across 6 test files. The test surface includes:
LRUPromptCachebehavior tests (rewind ranking, recencyrefresh, extraction lifecycle, partial-rewind safety)
BatchRotatingKVCacherewind + mask correctness after rotationBatchGeneratorcheckpoint callback contract tests with real modelinference (resume-from-checkpoint reproduces full-prompt logprobs)
ResponseGeneratorintegration tests (checkpoint forwarding/suppression,warm-cache localization, malformed-request worker survival)
Tests
pytest tests/test_prompt_cache.pypytest tests/test_prompt_cache_server_behavior.pypytest tests/test_prompt_cache_server_rewind_internal.pypytest tests/test_generate.pypytest tests/test_server.pyFull suite: 207 passed, 1 skipped, 3 warnings