A simulated GPU-aware LLM inference system that models:
- KV cache reuse (exact + prefix)
- GPU memory constraints and admission control
- Block-based KV cache allocation
- Eviction under memory pressure
- End-to-end latency impact via benchmarking
Includes comparative benchmarks across cache strategies and GPU-aware admission policies.
Workload → Router → Cache Reuse → GPU Memory → Block Allocation → Eviction → Inference → LatencyThis project models a real AI inference control plane, focusing on:
• reducing redundant inference computation
• maximizing cache reuse across nodes
• ensuring correctness under failure
• maintaining consistent distributed stateThe results below demonstrate how cache reuse and GPU memory constraints shape LLM inference performance.
We compare baseline inference, prefix reuse, exact cache hits, and GPU-aware admission with and without eviction.
| Scenario | Avg Latency (ms) | P95 Latency (ms) | Hit Rate | Rejection Rate |
|---|---|---|---|---|
| No Cache | 1405 | 1405 | 0% | 0% |
| Prefix Reuse | 985 | 1405 | 50% | 0% |
| Exact Cache | 205 | 205 | 100% | 0% |
| GPU-Aware | 843 | 1405 | 25% | 25% |
| GPU-Aware + Eviction | 1895 | 4205 | 25% | 0% |
Observation:
- Exact cache reuse eliminates most prefill cost, resulting in the lowest latency.
- Prefix reuse provides partial improvement proportional to reused tokens.
- No cache represents the baseline with full prefill cost.
- Prefix reuse improves average latency, but tail latency remains high when misses are still present in the workload.
- GPU-aware admission reduces average latency for admitted requests, but introduces rejection under memory pressure. Prefix reuse improves average latency, while tail latency remains sensitive to misses.
- GPU-aware admission reduces average latency by rejecting oversized requests under memory pressure. Adding eviction lowers rejection rate, but increases average and tail latency by admitting previously rejected expensive requests.
- Exact cache reuse reduces latency by ~85% vs no cache
- Prefix reuse improves average latency but not tail latency (p95 remains high)
- GPU-aware admission reduces latency by rejecting oversized requests
- Eviction reduces rejection but increases latency by admitting expensive requests
To validate the serving model, I integrated Ollama as a real inference backend and ran controlled experiments.
- Model: llama3.1:8b (Ollama)
- Controlled prompts (fixed output length)
- Scenarios:
- Cold request
- Warm request (same prompt)
- Prefix-related prompt
- Unrelated prompt
| Scenario | Total Latency | Prompt Eval | Decode |
|---|---|---|---|
| cold_1 | ~8488 ms | 177 ms | 5238 ms |
| warm_1 | ~5520 ms | 47 ms | 5372 ms |
| prefix_1 | ~5891 ms | 47 ms | 5747 ms |
| prefix_2 | ~5825 ms | 160 ms | 5570 ms |
| diff_1 | ~1891 ms | 163 ms | 1653 ms |
- Warm requests significantly reduce prompt evaluation latency
- End-to-end latency is dominated by decode (generation) time
- Prefix similarity alone does not reduce total latency without controlling output size
- Output length is a primary driver of latency in real inference systems
While prefix-aware caching reduces prefill cost, real-world LLM serving performance is largely governed by generation cost. This highlights the importance of combining caching with efficient decoding strategies.
Repeated experiments produced consistent latency ranges:
- prefix_1: ~5.1–5.3s
- prefix_2: ~5.8–6.0s
- diff_1: ~1.7s
This confirms that observed behavior is stable and not due to random variance.
While prefix-aware caching reduces prompt evaluation cost, real-world LLM latency is dominated by token generation. This highlights that effective serving optimization must address decode efficiency, not just prompt reuse.
To evaluate serving behavior under load, I added a concurrent Ollama inference benchmark.
- Model: Llama 3.1 8B via Ollama
- Prompt: fixed output length
- Workload: same prompt across concurrent requests
- Metric: average latency, p95 latency, wall-clock time, throughput
| Concurrency | Avg Latency | P95 Latency | Wall Clock | Throughput |
|---|---|---|---|---|
| 1 | 5,771 ms | 5,771 ms | 5,794 ms | 0.17 req/s |
| 3 | 10,963 ms | 16,299 ms | 16,310 ms | 0.18 req/s |
| 5 | 16,560 ms | 27,744 ms | 27,758 ms | 0.18 req/s |
| 10 | 29,040 ms | 53,525 ms | 53,543 ms | 0.19 req/s |
- Average latency increases as concurrency grows.
- P95 latency grows much faster than average latency.
- Throughput remains almost flat even with 10x more concurrent requests.
- Wall-clock time closely follows p95 latency, suggesting requests are contending for the same inference runtime resource.
Single-request latency is misleading.
Real LLM serving behavior is shaped by:
- queueing
- runtime contention
- scheduling
- GPU utilization
- tail latency
This experiment shows why production LLM serving systems need batching, scheduling, and admission control rather than only cache reuse.
| Entries | Write (ms) | Recovery Lookup (ms) |
|---|---|---|
| 100 | 2 ms | ~0 ms |
| 1000 | ~10 ms | ~2–4 ms |
| 5000 | 37 ms | 10 ms |
Note: Recovery time here measures lookup after WAL load, not full process restart recovery.
This project uses a WAL-backed key-value store (VeriStore) to persist LLM cache metadata (prefix → KV blocks, session routing).
- Local WAL-backed KV store
- Metadata operations:
- Cache entry registration
- Prefix lookup
- Recovery measured by:
- WAL open / replay time
- Metadata lookup after restart
| Entries | Write (ms) | Startup Recovery (ms) | Lookup (ms) | Total Recovery (ms) |
|---|---|---|---|---|
| 100 | 2 ms | 0 ms | 0 ms | ~7 ms |
| 1,000 | 12 ms | 3 ms | 2 ms | ~9 ms |
| 5,000 | 35 ms | 9 ms | 8 ms | ~20 ms |
- WAL replay is fast even with thousands of entries.
- Metadata recovery scales efficiently with number of entries.
- Total recovery time remains in the millisecond range.
Real LLM inference latency (Llama 3.1 8B via Ollama):
- ~5,000–8,000 ms per request
Storage (WAL-backed metadata) is not the bottleneck in LLM serving.
- During serving → inference dominates latency
- During recovery → storage determines system availability
Cache persistence enables fast recovery without significantly impacting serving latency.
This demonstrates that:
- KV cache metadata can be safely persisted
- Systems can recover quickly after failure
- Inference remains the dominant cost in LLM serving
To simulate real system pressure:
--concurrency=10 --max-active=3
| Metric | No Control | With Control |
|---|---|---|
| Accepted | 10 | 3 |
| Rejected | 0 | 7 |
| P95 Latency | ~53.5s | ~20.7s |
| Throughput | ~0.18 | ~0.15 |
Insight Accepting all requests → queue grows → latency explodes Limiting active requests → queue stays small → latency controlled
Good systems don’t try to serve everyone They protect latency by rejecting excess load
Large Language Model inference can be significantly accelerated through KV-cache reuse.
In distributed serving systems, this introduces key challenges:
• Where are cached prefixes stored?
• Which nodes hold reusable KV blocks?
• How should requests be routed for maximum reuse?
• How do we avoid redundant recomputation?This project implements a centralized metadata-driven control plane to solve these problems.
llm-serving-cache is the inference serving layer of a layered AI infrastructure stack:
| Layer | Project | Role |
|---|---|---|
| Storage | VeriStore | WAL durability, crash recovery, Raft replication, object storage |
| Inference Serving | llm-serving-cache (this project) | KV-cache placement and routing control plane, backed by VeriStore |
| Workload Orchestration | Veriflow | GPU-aware scheduler for training, inference, and evaluation jobs |
This project depends on VeriStore (github.com/NasitSony/VeriStore) as its durable metadata backend — all cache entries, session routes, and node registry state are persisted using VeriStore's WAL-backed KV engine, inheriting its crash-consistency and deterministic recovery guarantees.
This phase introduces a fixed-size block abstraction to model GPU memory for KV cache allocation.
Instead of tracking VRAM as a single value, memory is represented as discrete blocks, enabling realistic simulation of allocation, placement, and reclamation behavior in LLM serving systems.
GPU memory is divided into fixed-size blocks:
block_size = 16 MBEach node maintains a pool of blocks derived from its VRAM:
total_blocks = total_vram_mb / block_sizeEach pool tracks:
- free blocks
- allocated blocks
For each request:
required_blocks = ceil(kv_size_mb / block_size)Blocks are allocated from a selected node’s pool.
A *** best-fit strategy *** is used:
- consider nodes with sufficient free blocks
- compute leftover blocks after allocation
- select the node with the minimum leftover
Allocated blocks are tracked per request:
request_id → list of block_idsThis enables precise memory management and cleanup.
On request completion:
- allocated blocks are released
- block pool state is restored
Allocating request req-4
required_kv_mb=180 required_blocks=12
Candidate node-a: free_blocks=20 leftover_blocks=8
Candidate node-b: free_blocks=31 leftover_blocks=19
Selected pool=node-a reason=best_fit_blocks
allocated_blocks=[node-a-block-0,...,node-a-block-11]
node-a free_blocks=8 allocated_blocks=54Freed request req-4 blocks=[node-a-block-0,...,node-a-block-11]
node-a free_blocks=20 allocated_blocks=42Allocating request req-5
required_kv_mb=600 required_blocks=38
Candidate node-a: free_blocks=20 rejected=insufficient_blocks
Candidate node-b: free_blocks=31 rejected=insufficient_blocks
Allocation failed: no node has enough free blocks- GPU memory is modeled as discrete allocation units
- Best-fit placement improves memory utilization
- Request-level tracking enables accurate allocation and freeing
- Failure handling ensures correctness under resource constraints
This phase transforms the system from:
VRAM tracking → memory management systemand demonstrates how:
Request → KV size → Block allocation → Placement → Freedrive realistic GPU memory behavior in LLM serving systems.
This phase adds pressure handling to the KV cache block system.
When a request cannot be allocated because no node has enough free blocks, the system:
- triggers eviction
- skips active requests
- evicts the oldest inactive request
- retries allocation once
- rejects the request if eviction is still insufficient
This models realistic overload behavior in LLM serving systems.
Under block pressure, the allocator follows this sequence:
Allocate request
→ insufficient free blocks
→ trigger eviction
→ skip active requests
→ evict oldest inactive request
→ retry allocation
→ reject if still insufficientA simple oldest-request policy is used as the initial eviction strategy.
Behavior
- requests are tracked in allocation order
- active requests are protected from eviction
- inactive requests are eligible for eviction
- stale eviction entries are skipped safely
Triggering eviction...
Skipping active request req-1
Evicted oldest inactive request req-4 blocks=[node-a-block-0,node-a-block-1,node-a-block-2,node-a-block-3,node-a-block-4,node-a-block-5,node-a-block-6,node-a-block-7,node-a-block-8,node-a-block-9,node-a-block-10,node-a-block-11]
node-a free_blocks=20 allocated_blocks=42
Retrying allocation for req-5
Candidate node-a: free_blocks=20 required_blocks=38 rejected=insufficient_blocks
Candidate node-b: free_blocks=31 required_blocks=38 rejected=insufficient_blocks
Rejected request req-5: eviction insufficient- Eviction is demand-driven
- Triggered only when allocation fails under pressure
- Active requests are protected
- Prevents reclaiming memory from in-use allocations
- Eviction is safe
- Stale queue entries are skipped without corrupting allocator state
- Rejection is explicit
- If eviction cannot free enough space, the request is rejected cleanly
This phase makes the system behave more realistically under overload:
Pressure → Eviction → Retry → RejectIt demonstrates how a cache-aware memory system can preserve correctness while responding to GPU memory pressure.
Tracks:
• cache entries (prefix → node mapping)
• session routing decisions
• cache access patterns (for eviction)Maintains:
• available serving nodes
• capacity and utilization
• node livenessResponsible for request routing:
exact cache hit → reuse cached result
prefix match → reuse partial computation
cache miss → select node + trigger cache fillOrchestrates control-plane operations:
• cache registration
• node capacity updates
• routing delegation
• lifecycle managementDetermines where new cache blocks are placed:
• least-loaded node selection
• capacity-aware routing• KV cache size estimated per request
• routing constrained by GPU memory
• cache allocation consumes VRAM
• eviction reclaims VRAM
• requests rejected when memory insufficient
```bash
# 🔁 Cache Lifecycle
```bash
Client Request
↓
Router
↓
Session Affinity (if exists)
↓
Exact Cache Hit?
↓
Prefix Match?
↓
Cache Miss
↓
Node Selection (least-loaded)
↓
[If full] Evict old cache entry
↓
Route request to node
↓
Inference completes
↓
Register new cache entry
↓
Update node capacity
↓
KV-backed Metadata Store (WAL)This section describes the simplified latency model used to evaluate cache reuse and GPU-aware inference behavior.
The simulation models:
- routing overhead
- prefill (prompt processing)
- decode (generation)
- cache reuse via prefix matching
Each request is defined by:
- prompt_tokens
- output_tokens
- prefix_hit_tokens
Derived:
uncached_tokens = prompt_tokens - prefix_hit_tokens
Latency is computed as:
prefill_latency = uncached_tokens × prefill_cost
decode_latency = output_tokens × decode_cost
total_latency = routing_overhead + prefill_latency + decode_latency
A mixed workload includes:
- cache misses (no reuse)
- prefix hits (partial reuse)
Routing decisions determine cache reuse, directly impacting latency.
Eligible nodes: none
Rejected request: insufficient VRAM across all nodes required_kv_mb=1000
Average latency: 980 ms
Hit rate: 50%
Average hit latency: 555 ms
Average miss latency: 1405 ms
- Cache hits reduce latency by ~2.5× compared to misses
- Prefill dominates latency for cache misses
- Prefix reuse improves average latency but not tail latency
- GPU memory constraints influence both admission and performance
Routing → Cache Reuse → GPU Memory → Latency
- ✅ Exact cache lookup
- ✅ Longest-prefix cache reuse
- ✅ Session-affinity routing
- ✅ Cache fill after miss
- ✅ Node capacity tracking
- ✅ Capacity-aware placement
- ✅ Eviction on full node (LRU-style via access time)
- ✅ WAL-backed persistent metadata
- ✅ Crash recovery via deterministic WAL replaymodel_id: llama-70b
prefix_hash: 9fa21ab
block_id: block-134
node_id: node-asession_id: sess-101
model_id: llama-70b
node_id: node-allm-serving-cache/
├── CMakeLists.txt
├── README.md
│
├── include/
│ └── cache/
│ ├── cache_types.h
│ ├── metadata_store.h
│ ├── placement_policy.h
│ ├── router.h
│ ├── node_registry.h
│ └── coordinator.h
│
├── src/
│ ├── metadata_store.cpp
│ ├── placement_policy.cpp
│ ├── router.cpp
│ ├── node_registry.cpp
│ ├── coordinator.cpp
│ └── main.cpp
│
├── demos/
│ ├── cache_register_demo.cpp
│ ├── routing_demo.cpp
│ └── node_failure_demo.cpp
│
├── tests/
│ ├── metadata_store_test.cpp
│ ├── routing_test.cpp
│ └── placement_policy_test.cpp
│
└── docs/
├── architecture.md
├── api.md
└── roadmap.mdExact request routed to: node-a
Cache hit: yes
Prefix-match request routed to: node-a
Cache hit: yes
Miss request routed to: node-b
Cache hit: no
Registered new cache entry on: node-b
node-b used_capacity: 1/1
After fill routed to: node-b
Cache hit: yes
Evicted cache block from node: node-b• In-memory metadata store
• Cache registration
• Session routing
• Node registry• Exact cache hits
• Prefix-aware routing
• Session-affinity routing
• Cache fill after miss• Node capacity tracking
• Capacity-aware placement
• Eviction on full nodes
• Complete cache lifecycle• KV-backed metadata store using KV Shuttle
• WAL-based durability for cache and session metadata
• Crash recovery via WAL replay
• Deterministic reconstruction after restart• Raft-backed metadata replication
• Leader-based coordination
• Fault-tolerant cache placement decisionsv0.4 complete — persistent, crash-consistent control plane prototypeThis project builds on:
- - VeriStore — WAL-based, crash-consistent KV storage engine with Raft replication
https://github.com/NasitSony/VeriStore