Skip to content

cache-research: weekly report for 2026-05-10#5

Draft
thinkingfish wants to merge 1 commit intomainfrom
claude/intelligent-cannon-KwGUJ
Draft

cache-research: weekly report for 2026-05-10#5
thinkingfish wants to merge 1 commit intomainfrom
claude/intelligent-cannon-KwGUJ

Conversation

@thinkingfish
Copy link
Copy Markdown
Owner

Summary

Adds cache-research/weekly-cache-report-2026-05-10.md covering distributed caching, KV cache, caching for inference, and storage-system caching for the window 2026-05-04 → 2026-05-10. Nine primary entries (target was ~5; expanded to 9 because the post-DeepSeek-V4 wave + a vLLM-day + a LightSeek launch + a wave of arXiv 2605.xxxxx KV-cache submissions all landed in the same week), plus six "outside the window / outside scope" companion entries.

Headline entries

Production / empirical:

  • vLLM × Mooncake — Serving Agentic Workloads at Scale (May 7) — 80K+ token agent traces with 94%+ reusable prefixes; 3.8× throughput vs. local-prefix-cache vLLM. Mooncake Store becomes the vLLM-blessed cross-instance KV substrate for agentic workloads.
  • LightSeek Foundation — TokenSpeed launch (May 7) — MIT-licensed open-source MLA-first inference engine; ~9% min-latency / ~11% throughput vs. TensorRT-LLM at 100 TPS/User on Kimi K2.5; near-halved decode latency under speculative decoding (batch 4/8/16, long prefix KV); MLA kernel upstreamed into vLLM day-0.
  • LMCache — Deepseek V4 explained, why it matters to your wallet (May 4) — operator-facing economics post tying V4's CSA + HCA compression directly to 2–3× cheaper token prices.
  • vLLM v0.20.1 (May 4) and v0.20.2 (May 10) — DeepSeek V4 stabilization patches, including a V1 KV-cache-manager block-allocation fix.

Academic / idea-forward:

  • eOptShrinkQ (arXiv:2605.02905) — spectral-shrinkage low-rank decomposition + TurboQuant on the residual; ~1 bit/entry savings over TurboQuant with end-to-end LongBench evaluation.
  • One Pool, Two Caches / HELM (arXiv:2605.04450) — first formal treatment of EMB-vs-KV HBM partitioning for generative recommenders; 0.35-of-pool optimal-split swing measured across workloads.
  • RetentiveKV (arXiv:2605.04075) — state-space-memory eviction for multimodal KV cache; 5× compression and 1.5× decode speedup with 2× retrieval-score win at 5% budget.
  • ZeRO-Prefill (arXiv:2605.02960) — async weight-AllGather replaces activation-AllToAll for MoE prefill, eliminating dispatch redundancy inherited from the decoding era.
  • Lighthouse Attention (arXiv:2605.06554) — training-only hierarchical SDPA wrapper; matches dense-SDPA quality at the same token budget while keeping inference-time attention dense.

Cross-cutting observations

  • Distributed KV cache is now vLLM-blessed for agentic traffic.
  • MLA kernels are an open-ecosystem competition surface, with ~2-month catch-up to TensorRT-LLM.
  • Theory (eOptShrinkQ, position paper, last week's CapKV) continues to catch mechanism.
  • KV is no longer alone in HBM — generative recommenders force EMB-vs-KV joint management.
  • MoE prefill needs its own dispatch primitive (ZeRO-Prefill).
  • Erasure-coded KV (GhostServe) and continuous-memory eviction (RetentiveKV) reframe KV cache as a real storage system.

Coverage self-check

  • Frontier AI labs: DeepSeek (V4 wallet/V4-Plus indirectly), Anthropic / OpenAI / Google DeepMind / Meta / xAI / Mistral / Cohere (no in-window KV-cache content), NVIDIA (Dynamo 1.0, CMX outside window), Microsoft (none), Moonshot/Kimi (Mooncake/vLLM ✓), Alibaba/Qwen / Apple / Amazon Science (HyperPod outside) / Tencent (Hy3 model release, no KV-cache angle in-window) / Baidu (none).
  • Vendors: vLLM ✓ (×2), SGLang (no in-window post; April 25 V4 day-0 outside), LMCache ✓, TensorRT-LLM (indirect via TokenSpeed), NVIDIA Dynamo (1.0 outside window), llm-d (April 21 v0.5 outside), Red Hat AI / Together / Fireworks / Anyscale / Modal / Replicate / RunPod / CoreWeave (none in-window), VAST Data / Pure Storage / WEKA (none in-window; CMX-related news cluster outside window).
  • Storage: CacheLib / Cachelib-FDP / RocksDB / Ceph / MinIO (none new); AWS storage (HyperPod November 2025); Google Cloud Managed Lustre (April 22, just outside window — flagged in additional context); Azure / Snowflake / Databricks engineering blogs (none in-window).
  • Conferences: USENIX FAST '26 (Feb 24–26 ✓ — papers in record), OSDI / ATC / Security '26 (none in-window), SOSP / EuroSys '26 (April 27–30 outside window), ASPLOS '26 (March 22–26 outside window), HotStorage / SoCC / MLSys / NeurIPS / ICLR / ICML (none in-window).
  • arXiv: cs.DC, cs.OS, cs.AR, cs.CL, cs.LG with KV/cache/attention/inference keywords swept through 2604.xxxxx, 2605.xxxxx, 2606.xxxxx prefixes.
  • Curated trackers: Awesome-KV-Cache-Management, Awesome-KV-Cache-Optimization, Awesome-KV-Cache-Compression all checked.

Test plan

https://claude.ai/code/session_018DNoJMRCgb7pwFZCneFo9r


Generated by Claude Code

Covers 2026-05-04 → 2026-05-10. Headlines: vLLM × Mooncake distributed
KV for agentic workloads (3.8× throughput, 94% reusable prefixes),
LightSeek TokenSpeed inference engine launch with day-0 vLLM MLA
adoption, LMCache's V4 wallet-economics post, eOptShrinkQ
spectral-shrinkage KV compression (arXiv:2605.02905), HELM adaptive
HBM partitioning for generative recommenders (arXiv:2605.04450),
RetentiveKV state-space eviction, ZeRO-Prefill MoE serving, and
Lighthouse Attention long-context pretraining.

https://claude.ai/code/session_018DNoJMRCgb7pwFZCneFo9r
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants