cache-research: weekly report for 2026-05-10 by thinkingfish · Pull Request #5 · thinkingfish/misc

thinkingfish · 2026-05-10T16:26:44Z

Summary

Adds cache-research/weekly-cache-report-2026-05-10.md covering distributed caching, KV cache, caching for inference, and storage-system caching for the window 2026-05-04 → 2026-05-10. Nine primary entries (target was ~5; expanded to 9 because the post-DeepSeek-V4 wave + a vLLM-day + a LightSeek launch + a wave of arXiv 2605.xxxxx KV-cache submissions all landed in the same week), plus six "outside the window / outside scope" companion entries.

Headline entries

Production / empirical:

vLLM × Mooncake — Serving Agentic Workloads at Scale (May 7) — 80K+ token agent traces with 94%+ reusable prefixes; 3.8× throughput vs. local-prefix-cache vLLM. Mooncake Store becomes the vLLM-blessed cross-instance KV substrate for agentic workloads.
LightSeek Foundation — TokenSpeed launch (May 7) — MIT-licensed open-source MLA-first inference engine; ~9% min-latency / ~11% throughput vs. TensorRT-LLM at 100 TPS/User on Kimi K2.5; near-halved decode latency under speculative decoding (batch 4/8/16, long prefix KV); MLA kernel upstreamed into vLLM day-0.
LMCache — Deepseek V4 explained, why it matters to your wallet (May 4) — operator-facing economics post tying V4's CSA + HCA compression directly to 2–3× cheaper token prices.
vLLM v0.20.1 (May 4) and v0.20.2 (May 10) — DeepSeek V4 stabilization patches, including a V1 KV-cache-manager block-allocation fix.

Academic / idea-forward:

eOptShrinkQ (arXiv:2605.02905) — spectral-shrinkage low-rank decomposition + TurboQuant on the residual; ~1 bit/entry savings over TurboQuant with end-to-end LongBench evaluation.
One Pool, Two Caches / HELM (arXiv:2605.04450) — first formal treatment of EMB-vs-KV HBM partitioning for generative recommenders; 0.35-of-pool optimal-split swing measured across workloads.
RetentiveKV (arXiv:2605.04075) — state-space-memory eviction for multimodal KV cache; 5× compression and 1.5× decode speedup with 2× retrieval-score win at 5% budget.
ZeRO-Prefill (arXiv:2605.02960) — async weight-AllGather replaces activation-AllToAll for MoE prefill, eliminating dispatch redundancy inherited from the decoding era.
Lighthouse Attention (arXiv:2605.06554) — training-only hierarchical SDPA wrapper; matches dense-SDPA quality at the same token budget while keeping inference-time attention dense.

Cross-cutting observations

Distributed KV cache is now vLLM-blessed for agentic traffic.
MLA kernels are an open-ecosystem competition surface, with ~2-month catch-up to TensorRT-LLM.
Theory (eOptShrinkQ, position paper, last week's CapKV) continues to catch mechanism.
KV is no longer alone in HBM — generative recommenders force EMB-vs-KV joint management.
MoE prefill needs its own dispatch primitive (ZeRO-Prefill).
Erasure-coded KV (GhostServe) and continuous-memory eviction (RetentiveKV) reframe KV cache as a real storage system.

Coverage self-check

Frontier AI labs: DeepSeek (V4 wallet/V4-Plus indirectly), Anthropic / OpenAI / Google DeepMind / Meta / xAI / Mistral / Cohere (no in-window KV-cache content), NVIDIA (Dynamo 1.0, CMX outside window), Microsoft (none), Moonshot/Kimi (Mooncake/vLLM ✓), Alibaba/Qwen / Apple / Amazon Science (HyperPod outside) / Tencent (Hy3 model release, no KV-cache angle in-window) / Baidu (none).
Vendors: vLLM ✓ (×2), SGLang (no in-window post; April 25 V4 day-0 outside), LMCache ✓, TensorRT-LLM (indirect via TokenSpeed), NVIDIA Dynamo (1.0 outside window), llm-d (April 21 v0.5 outside), Red Hat AI / Together / Fireworks / Anyscale / Modal / Replicate / RunPod / CoreWeave (none in-window), VAST Data / Pure Storage / WEKA (none in-window; CMX-related news cluster outside window).
Storage: CacheLib / Cachelib-FDP / RocksDB / Ceph / MinIO (none new); AWS storage (HyperPod November 2025); Google Cloud Managed Lustre (April 22, just outside window — flagged in additional context); Azure / Snowflake / Databricks engineering blogs (none in-window).
Conferences: USENIX FAST '26 (Feb 24–26 ✓ — papers in record), OSDI / ATC / Security '26 (none in-window), SOSP / EuroSys '26 (April 27–30 outside window), ASPLOS '26 (March 22–26 outside window), HotStorage / SoCC / MLSys / NeurIPS / ICLR / ICML (none in-window).
arXiv: cs.DC, cs.OS, cs.AR, cs.CL, cs.LG with KV/cache/attention/inference keywords swept through 2604.xxxxx, 2605.xxxxx, 2606.xxxxx prefixes.
Curated trackers: Awesome-KV-Cache-Management, Awesome-KV-Cache-Optimization, Awesome-KV-Cache-Compression all checked.

Test plan

User reads the report and confirms entry choices, novelty assessments, and runtime-evaluation flags.
Verify links resolve (arXiv IDs, vLLM release notes, LMCache blog, vLLM × TokenSpeed launch tweets).
Spot-check headline numbers against primary sources for entries add twitter emoji archive as split zip files #1, Weekly cache research report — 2026-04-21 (catch-up run) #2, cache-research: weekly report for 2026-05-10 #5, #6, #7.

https://claude.ai/code/session_018DNoJMRCgb7pwFZCneFo9r

Generated by Claude Code

Covers 2026-05-04 → 2026-05-10. Headlines: vLLM × Mooncake distributed KV for agentic workloads (3.8× throughput, 94% reusable prefixes), LightSeek TokenSpeed inference engine launch with day-0 vLLM MLA adoption, LMCache's V4 wallet-economics post, eOptShrinkQ spectral-shrinkage KV compression (arXiv:2605.02905), HELM adaptive HBM partitioning for generative recommenders (arXiv:2605.04450), RetentiveKV state-space eviction, ZeRO-Prefill MoE serving, and Lighthouse Attention long-context pretraining. https://claude.ai/code/session_018DNoJMRCgb7pwFZCneFo9r

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache-research: weekly report for 2026-05-10#5

cache-research: weekly report for 2026-05-10#5
thinkingfish wants to merge 1 commit intomainfrom
claude/intelligent-cannon-KwGUJ

thinkingfish commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thinkingfish commented May 10, 2026

Summary

Headline entries

Cross-cutting observations

Coverage self-check

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants