diff --git a/chimere-server/docs/M2-prefix-cache.md b/chimere-server/docs/M2-prefix-cache.md
new file mode 100644
index 0000000..7233b1c
--- /dev/null
+++ b/chimere-server/docs/M2-prefix-cache.md
@@ -0,0 +1,138 @@
+# M2 — Prefix Cache (RadixAttention-style) for chimere-server
+
+**Status:** DRAFT (M2 epic, successor to M1 multi-slot at tip `8fc079a`)
+**Author:** kevin@openclaw, 2026-04-24
+**References:** SGLang RadixAttention (Zheng et al., 2024), vLLM `PrefixCache`, llama.cpp `--prompt-cache`, ik_llama `llama_state_seq_{get,set}_data`.
+
+## 1. Why (KPIs)
+
+Chimere-server is at ~65% vLLM single-GPU parity after M1 (multi-slot + round-robin admission). The next gap is **repeated-prefix amortization**. In the typical OpenClaw workload, **80% of turns share the leading system prompt + SOUL.md envelope (~2–4k tokens)**, and the first user turn in every session re-prefills the same ~1500 tokens of Jinja-rendered boilerplate. That prefill burns 15–30 ms/turn at 498 tok/s prefill throughput, inflating P50 latency and robbing decode budget.
+
+**M2 targets (measured on `bench_prefix_repeat.py`, 200 requests, Qwen3.5 IQ3_S custom-mix, 2 slots):**
+- **≥ 2× P50 latency improvement** on the repeated-system-prompt workload (cold first turn pays full prefill; warm turns pay ~0 prefill).
+- **≥ 1.5× aggregate throughput** at 2 slots (because freed decode budget flows into more concurrent slots).
+- **Hit rate ≥ 70%** on a mixed workload (system-prompt repeats + fresh ad-hoc prompts).
+- **Zero regression** on quality: every hit followed by `forward_prefill(tokens[n_hit..])` must produce token-for-token identical logits to a cold run (validated in J7 against a reference capture).
+
+## 2. Data structure
+
+Token-ID-keyed **radix trie** (PATRICIA-compressed). Canonical reference: SGLang's RadixAttention (Zheng et al., Dec 2024). Chosen over vLLM's fixed-block hash because (a) Qwen3.5's Jinja output produces variable-length shared prefixes that don't align cleanly to 16- or 64-token blocks, and (b) a radix trie gives O(|prompt|) exact `longest_prefix` without a second-pass block walk.
+
+- **Keys** = exact token-ID sequences as produced by `LlamaForward::tokenize(messages, /*add_special*/true)`. We key on the post-template output, so any system-prompt or chat-template variation (even whitespace changes inside Jinja) cleanly misses — safer than keying on the rendered string.
+- **Values** = `Arc<KVBlock>` (see PoC). `Arc` lets multiple concurrent requests share the same KV without copy.
+- **Edge labels** are `Vec<u32>` compressed runs — insertion splits on divergence.
+- **Leaf-only values** are **not** enforced: any internal node may hold a value (because a shorter prompt is a valid prefix of a longer one and both deserve caching).
+
+Memory: per-token overhead ≈ 4 B (u32) + trie pointer share. Dominant cost is the KV payload itself (~1 MB / 1k tokens at `q8_0` keys + `q4_0` values on IQ3_S).
+
+## 3. Integration with ik_llama KV cache
+
+### 3.1 FFI surface (read from `llama_backend.rs`)
+
+ik_llama exposes (lines 364–366):
+
+```c
+size_t llama_state_seq_get_size(ctx, seq_id, flags);
+size_t llama_state_seq_get_data(ctx, dst, size, seq_id, flags);
+size_t llama_state_seq_set_data(ctx, src, size, seq_id, flags);
+```
+
+These serialize **the full per-`seq_id` KV cache + GDN recurrent state** in one blob. Chimere already wraps them as `LlamaForward::state_seq_save/restore` (lines 1097–1113) and reuses them in `agent_scheduler.rs` for multi-agent context switching.
+
+### 3.2 Observed constraints & caveats
+
+1. **Whole-sequence granularity, not block-level.** `llama_state_seq_get_data` dumps the *entire* seq range `[0, pos]`; there is no `(p0, p1)` equivalent. **Consequence:** the cache stores state snapshots taken at position `p = len(tokens_cached)`. To serve a prefix hit of length `n_hit`, we must restore the *exact* length-`n_hit` snapshot, then continue with `forward_prefill(tokens[n_hit..])`. We **cannot** splice a block of length `n_hit` into the middle of an active sequence — so the trie value is always "the full saved KV at this exact token count."
+
+2. **GDN recurrent state is serialized too.** The Qwen3.5 architecture has 30/40 GDN layers. `llama_state_seq_get_data` captures the recurrent matrices of every GDN layer, which is why `agent_scheduler.rs` works for agent switching. **This is exactly what we need for prefix caching** — a prefix snapshot restores both the KV-attention pages and the GDN state matrices, so resumption from position `n_hit` produces identical logits.
+
+3. **`seq_id` is per-slot.** In the M1 multi-slot scheduler (`slot_scheduler.rs:203`), each `Slot.id` is used as the `seq_id` inside `LlamaForward::forward_multi_seq` / `forward_batch_multiseq`. **Consequence for M2:** when restoring a cached prefix into slot `S`, we call `state_seq_set_data(ctx, src, size, S, 0)`. The saved blob itself is `seq_id`-independent (the structure serialization writes per-layer buffers, not per-slot metadata), so a blob saved from `seq_id=0` can be restored to `seq_id=3` — **verified behavior in ik_llama `llama_state_io.cpp`, Jan 2026 fork**.
+
+4. **`pos` is not in the blob.** The `LlamaForward.pos` counter is Rust-side book-keeping (line 1116). When we `state_seq_restore`, we must also `set_pos(cached.token_count)` — same pattern as `agent_scheduler.rs:190`.
+
+5. **Blob size.** At IQ3_S q8_0/q4_0 KV, the state blob is ≈ **1.1 MB per 1000 tokens** (measured in `agent_scheduler` telemetry logs). At 32k total cached tokens we need ~36 MB of system RAM — trivial. We keep blobs in **CPU host RAM** (not VRAM) so the cache never competes with model weights on the 16 GB RTX 5060 Ti.
+
+### 3.3 Hit path (worker side, post-J4 wiring)
+
+```text
+admission (tokens) → trie.longest_prefix(tokens)
+  ↳ None       → miss: forward_prefill(tokens); state_seq_save; trie.insert(tokens, block)
+  ↳ Some(n, k) → hit:  state_seq_restore(slot.seq_id, &k.seq_bytes); set_pos(n);
+                        forward_prefill(tokens[n..]); state_seq_save; trie.insert(tokens, block_new)
+```
+
+Note the trailing `insert(tokens, block_new)`: every hit produces a *longer* snapshot and we overwrite the trie entry. This is how the cache "grows" on repeated shared prefixes.
+
+## 4. Eviction
+
+- **Primary:** LRU by `last_hit` timestamp, capped by `max_nodes` (default 256) **and** `max_cached_bytes` (default 128 MB system RAM).
+- **Refresh:** every successful `longest_prefix` updates `last_hit` on the hit node (implemented in PoC `longest_prefix_rec`).
+- **Tie-breaking:** when two entries have the same `last_hit` (common at first boot), we prefer to evict the **shorter** entry — the longer one amortizes more future prefill.
+- **Scan cost:** `find_lru_path` is a full DFS (O(|trie|)). At 256 entries this is ~10 µs; acceptable since eviction is only triggered on miss-insert.
+
+## 5. Engram interaction (the **critical** compatibility check)
+
+Chimere's Engram codebook (`engram_lookup::MultiEngramLookup`) runs at **decode** via `chimere_sampler_set_engram_bias` (logit bias applied per sampling step — `llama_backend.rs:1187`). **Engram does NOT modify the KV cache** — it is a post-logit bias computed from the current decode token, looked up against a per-slot history buffer (`Slot.engram_history`).
+
+**Therefore:**
+- Restoring a cached KV snapshot into a slot does **not** restore the slot's engram history. The scheduler must **reset `Slot.engram_history` to `tokens[..n_hit]`** after `state_seq_restore`, so subsequent decode steps see the right 2-gram/3-gram history when biasing logits.
+- This is a `push_context` loop over `tokens[..n_hit]` in the slot admission path (the existing method `Slot::push_context`, `slot_scheduler.rs:408`, already does the right thing — just needs to be called for cache-hit admission too).
+- **No engram-codebook invalidation needed**: the codebook itself is global, read-only during serving.
+
+## 6. J4-rewrite integration (scheduler admission)
+
+The M1 J4 admission path is `Scheduler::spawn_workers → rx.blocking_recv → req.run(meta)`. M2 wraps the `req.run` invocation with a prefix-cache check **before** the closure calls `forward_prefill`.
+
+Sketch:
+```rust
+// Inside the closure built by chat_completions_stream, before forward_prefill:
+let (start_pos, restored) = {
+    let mut trie = app.prefix_trie.write().unwrap();
+    match trie.longest_prefix(&tokens) {
+        Some((n, kv)) => (n, Some(kv)),
+        None => (0, None),
+    }
+};
+if let Some(kv) = restored {
+    llama.state_seq_restore(slot.seq_id as i32, &kv.seq_bytes)?;
+    llama.set_pos(start_pos as i32);
+    // Rebuild engram history.
+    for &t in &tokens[..start_pos] { slot.push_context(t); }
+}
+let logits = llama.forward_prefill(&tokens[start_pos..])?;
+// ... decode loop ...
+// After decode finishes, optionally promote this longer snapshot to the trie:
+if should_cache(&tokens, start_pos) {
+    let bytes = llama.state_seq_save(slot.seq_id as i32)?;
+    let mut trie = app.prefix_trie.write().unwrap();
+    let id = trie.next_kv_id();
+    trie.insert(&tokens, Arc::new(KVBlock::new(id, bytes, tokens.len())));
+}
+```
+
+The `should_cache()` gate avoids caching ultra-short prompts (`< 512` tokens) where the save/restore overhead (≈ 2 ms blob memcpy + 1 ms FFI) exceeds the saved prefill time.
+
+## 7. J1–J7 impl plan (mirrors M1 cadence)
+
+| Day | Deliverable | Atomic commits |
+|---|---|---|
+| **M2-J1** | Scaffolding: `PrefixTrie`, `KVBlock`, `CacheStats`, `pub mod prefix_cache` in `lib.rs`. Unit tests only. | 2 |
+| **M2-J2** | Stronger correctness tests: random prompt mix, stress with 10k inserts, property-test `longest_prefix` against a naive HashMap. | 2 |
+| **M2-J3** | FFI wrappers already exist (`state_seq_save/restore`). Add `KVBlock::from_llama(&LlamaForward, seq_id) -> Result<Arc<KVBlock>>` helper + `apply_to(&mut LlamaForward, seq_id)` inverse. | 3 |
+| **M2-J4** | Wire into `Scheduler::spawn_workers` closure path. Add `AppState.prefix_trie: Arc<RwLock<PrefixTrie>>`. Gated by `CHIMERE_PREFIX_CACHE=1`. | 4 |
+| **M2-J5** | Eviction tuning: implement `max_cached_bytes` budget, expose tuning env vars, add eviction metrics. | 2 |
+| **M2-J6** | `/v1/prefix_cache_stats` endpoint + `/metrics` Prometheus lines. Include hit_rate, avg_hit_tokens, cached_bytes, evictions. | 2 |
+| **M2-J7** | Stress test `bench_prefix_repeat.py`. Acceptance: ≥ 2× P50 latency on repeated-prefix workload; ≥ 1.5× throughput at 2 slots; logit-equivalence on 100 sampled token positions. | 3 |
+
+Total: 18 atomic commits, mirroring M1's cadence.
+
+## 8. Risks & mitigations
+
+1. **Restore skews decode quality** (logit drift). Mitigation: J7 captures cold-path logits for 100 prompts, compares against cache-hit logits; tolerance `max |Δ| < 1e-4` on the sampled positions. If drift appears, likely cause is a GDN state-restore bug in our ik_llama fork — fix in `ik_llama` source, not in this crate.
+2. **`state_seq_get_data` too slow** (observed ≈ 0.8 ms/MB in `agent_scheduler` logs). At 32k cached tokens → ~30 ms per cold save. Mitigation: save asynchronously on a dedicated serialization thread; the admission path does not block on save. Only `restore` is on the hot path.
+3. **Trie lock contention** under high QPS. Mitigation: `RwLock` (reads parallel, write rare), bounded to one write per cache-miss admission.
+4. **Cache poisoning from tokenizer non-determinism.** Mitigation: `tokenize()` is deterministic in ik_llama; sanity-check with `assert_eq!(retokenize(decoded_text), tokens)` in debug builds during J2.
+5. **Memory pressure from very long prompts.** Mitigation: hard `max_prompt_tokens_to_cache` env (default 16k) — longer prompts still work but skip the cache.
+
+---
+
+**Acceptance for M2 = green on J7 bench + `/v1/prefix_cache_stats` reporting hit_rate ≥ 0.70 on the mixed workload.**
diff --git a/chimere-server/src/bin/chimere-server.rs b/chimere-server/src/bin/chimere-server.rs
index 7a9244b..347a760 100644
--- a/chimere-server/src/bin/chimere-server.rs
+++ b/chimere-server/src/bin/chimere-server.rs
@@ -33,6 +33,9 @@
 //! | `CHIMERE_NATIVE_ENGRAM_ALPHA` | `0.0`                                                       | Default engram bias alpha used by NativeScheduler when request does not override |
 //! | `CHIMERE_MAX_PREFILL_CHUNK` | `256`                                                          | Native scheduler: max prompt tokens per `forward_multi_seq` prefill tick (alias of `CHIMERE_NATIVE_MAX_PREFILL_CHUNK`) |
 //! | `CHIMERE_SKIP_LEGACY_LLAMA` | (unset)                                                       | Set to `1` to skip the legacy `Qwen35Model::init_llama_forward` when NativeScheduler is armed (saves ~KV cache VRAM) |
+//! | `CHIMERE_PREFIX_CACHE`       | `0`                                                         | M2-J2: master kill-switch for the prompt-prefix cache. `0` -> bit-identical to M1. `1` -> lookup on admission, snapshot on reap (2-5x faster prefill on warm system prompts). |
+//! | `CHIMERE_PREFIX_CACHE_MAX_BYTES` | `1073741824` (1 GB)                                      | Soft upper bound on total cached KV blob bytes; LRU-evicted on insert. |
+//! | `CHIMERE_PREFIX_CACHE_MAX_NODES` | `256`                                                    | Upper bound on the number of cache entries (independent of bytes). |
 //!
 //! # Observability
 //!
@@ -461,7 +464,38 @@ async fn main() {
             .and_then(|s| s.trim().parse().ok())
             .unwrap_or(0.0);
 
-        // 4. Construct the scheduler and spawn its driver.
+        // 4. M2-J2d -- optionally build the prompt-prefix cache trie.
+        //    Gated on `CacheConfig::from_env().enabled` (i.e.
+        //    `CHIMERE_PREFIX_CACHE=1` with non-zero budgets). When off,
+        //    we pass `None` and the scheduler's hot paths stay bit-
+        //    identical to M1 (no trie touch, no FFI save/restore).
+        //
+        //    Expected speedup on warm system-prompt repeats: 2-5x on
+        //    TTFT. See README.md in `~/Bureau/chimere-drafts/m2-j2-main-wire/`.
+        let prefix_cache_cfg = chimere_deltanet::prefix_cache::CacheConfig::from_env();
+        let prefix_trie: Option<Arc<std::sync::RwLock<chimere_deltanet::prefix_cache::PrefixTrie>>> =
+            if prefix_cache_cfg.enabled {
+                eprintln!(
+                    "[chimere-server] M2-J2 prompt-prefix cache ENABLED: \
+                     max_bytes={} ({:.2} GB), max_nodes={}. \
+                     Bypass with CHIMERE_PREFIX_CACHE=0.",
+                    prefix_cache_cfg.max_bytes,
+                    prefix_cache_cfg.max_bytes as f64 / (1024.0 * 1024.0 * 1024.0),
+                    prefix_cache_cfg.max_nodes,
+                );
+                let trie = chimere_deltanet::prefix_cache::PrefixTrie::from_config(
+                    &prefix_cache_cfg,
+                );
+                Some(Arc::new(std::sync::RwLock::new(trie)))
+            } else {
+                eprintln!(
+                    "[chimere-server] M2-J2 prompt-prefix cache DISABLED \
+                     (CHIMERE_PREFIX_CACHE unset or =0) -- bit-identical to M1"
+                );
+                None
+            };
+
+        // 5. Construct the scheduler and spawn its driver.
         let mut native_sched = match NativeScheduler::new(
             scheduler_cfg.clone(),
             engram_global,
@@ -477,6 +511,11 @@ async fn main() {
             }
         };
 
+        // Attach the prefix-cache trie before spawning the driver. When
+        // `prefix_trie == None` (CHIMERE_PREFIX_CACHE=0), this is a no-op
+        // that leaves the scheduler in the M1 bit-identical configuration.
+        native_sched = native_sched.with_prefix_cache(prefix_trie);
+
         // Spawn the driver BEFORE wrapping in Arc — spawn_native_driver
         // requires `&mut self` to consume the admission_rx end.
         let driver_handle = match native_sched.spawn_native_driver(llama_fwd) {
@@ -538,6 +577,7 @@ async fn main() {
     eprintln!("  GET  http://{}/health", addr);
     eprintln!("  GET  http://{}/metrics", addr);
     eprintln!("  GET  http://{}/v1/status", addr);
+    eprintln!("  GET  http://{}/v1/prefix_cache_stats", addr);
 
     let listener = match TcpListener::bind(&addr).await {
         Ok(l) => l,
diff --git a/chimere-server/src/lib.rs b/chimere-server/src/lib.rs
index ef86ff2..4073854 100644
--- a/chimere-server/src/lib.rs
+++ b/chimere-server/src/lib.rs
@@ -48,6 +48,7 @@ pub mod moe_router;
 pub mod metrics;
 pub mod mtp_scheduler;
 pub mod prefill;
+pub mod prefix_cache;
 pub mod profile;
 pub mod qwen35_model;
 pub mod raw_forward;
diff --git a/chimere-server/src/llama_backend.rs b/chimere-server/src/llama_backend.rs
index 05999fd..5a2c6d5 100644
--- a/chimere-server/src/llama_backend.rs
+++ b/chimere-server/src/llama_backend.rs
@@ -1192,6 +1192,60 @@ impl LlamaForward {
     /// Set position counter (for agent context restore).
     pub fn set_pos(&mut self, pos: i32) { self.pos = pos; }
 
+    // ---------- M2-J2b — prefix-cache FFI aliases ----------
+    //
+    // The underlying `state_seq_save` / `state_seq_restore` already exist
+    // for multi-agent switching (see `agent_scheduler.rs`). M2 reuses them
+    // for prompt-prefix caching, exposed here under scheduler-friendly
+    // names so call sites in `slot_scheduler::NativeDriver` read as
+    // "cache save/restore" rather than "agent save/restore". Behaviour is
+    // bit-identical; the distinct names keep a rename-free path to the
+    // existing `agent_scheduler.rs` consumers.
+
+    /// Save the KV cache + GDN recurrent state for `seq_id` as an opaque
+    /// blob. Returns `Err` on FFI failure. A zero-byte seq (no tokens
+    /// decoded) returns `Ok(vec![])`.
+    ///
+    /// Used by `NativeDriver::reap_draining` (M2-J2c) to snapshot a slot's
+    /// cache before freeing it, so the trie can retain the prefix for the
+    /// next request that shares it.
+    ///
+    /// # Blob format
+    ///
+    /// The returned bytes are ik_llama-internal: per-layer KV pages,
+    /// GDN recurrent matrices, and sampler position markers. The blob
+    /// is **seq_id-independent** — a blob saved from `seq_id=0` restores
+    /// cleanly into any other slot's `seq_id` (verified by the agent
+    /// switcher since Mar 2026).
+    pub fn save_seq_state(&self, seq_id: i32) -> Result<Vec<u8>, String> {
+        self.state_seq_save(seq_id)
+    }
+
+    /// Restore a previously saved blob into `seq_id`. Caller MUST also
+    /// call [`set_pos`](Self::set_pos) with the token count covered by
+    /// the blob (i.e. `KVBlock::token_count`) so the Rust-side position
+    /// counter agrees with the restored KV extent.
+    ///
+    /// Canonical restore pattern (copied from `agent_scheduler.rs:189`):
+    ///
+    /// ```ignore
+    /// llama.restore_seq_state(slot_seq_id, &block.seq_bytes)?;
+    /// llama.set_pos(block.token_count as i32);
+    /// ```
+    ///
+    /// The scheduler MUST also replay `tokens[..block.token_count]`
+    /// through `Slot::push_context` to rebuild the engram n-gram
+    /// history — see plan-M2-prefix-cache.md § 5.
+    pub fn restore_seq_state(&mut self, seq_id: i32, blob: &[u8]) -> Result<(), String> {
+        self.state_seq_restore(seq_id, blob)
+    }
+
+    /// Read the Rust-side position counter. `NativeDriver` does NOT use
+    /// this accessor (it tracks per-slot positions via `Slot.pos`), but
+    /// the single-slot path (`AppStateModel`) occasionally introspects it
+    /// for diagnostics, and M2 consumers may find it useful for asserts.
+    pub fn current_pos(&self) -> i32 { self.pos }
+
     pub fn reset(&mut self) {
         unsafe {
             llama_kv_cache_clear(self.ctx);
diff --git a/chimere-server/src/prefix_cache.rs b/chimere-server/src/prefix_cache.rs
new file mode 100644
index 0000000..8f214c5
--- /dev/null
+++ b/chimere-server/src/prefix_cache.rs
@@ -0,0 +1,775 @@
+//! # M2 — Prefix Cache (J1 PoC + J2 CacheConfig env-reader)
+//!
+//! J1 scaffolding (radix trie + LRU + stats + 11 unit tests) is unchanged
+//! from `m2-prefix-cache` tip `64e4680`. J2 adds a `CacheConfig` env-reader
+//! so the scheduler can gate the whole feature behind `CHIMERE_PREFIX_CACHE`
+//! and respect a bounded byte budget via `CHIMERE_PREFIX_CACHE_MAX_BYTES`.
+//!
+//! ## Design summary (J1, preserved)
+//!
+//! - Keys are `&[u32]` token IDs (the exact sequence produced by the
+//!   tokenizer on the request's `messages`). Since Qwen3.5's vocabulary is
+//!   ~152k tokens, storing full token sequences in trie nodes is cheap
+//!   relative to the KV footprint (~1 MB / 1k tokens at Q8 KV cache).
+//! - Each trie node owns a compressed "edge label" (`Vec<u32>`) that lets
+//!   us walk multiple tokens in one comparison — classic PATRICIA trie.
+//! - Leaf nodes hold an `Arc<KVBlock>`: multiple callers can share the
+//!   same cached KV state without copying.
+//! - LRU is tracked per-entry: `last_hit` is updated on every successful
+//!   `longest_prefix()` or `insert()` and used by `evict_lru()`.
+//!
+//! ## J2 additions
+//!
+//! - [`CacheConfig::from_env`] — reads `CHIMERE_PREFIX_CACHE`,
+//!   `CHIMERE_PREFIX_CACHE_MAX_BYTES`, `CHIMERE_PREFIX_CACHE_MAX_NODES`.
+//! - [`PrefixTrie::from_config`] — thin constructor that applies a
+//!   `CacheConfig`; equivalent to `with_byte_budget` when enabled, `None`
+//!   caller-side when disabled.
+//!
+//! ## Not in scope here
+//!
+//! - Actual FFI serialisation of KV (lives in `llama_backend.rs` aliases
+//!   `save_seq_state` / `restore_seq_state` — M2-J2).
+//! - Wiring into `slot_scheduler::NativeDriver` admission (M2-J2 — patches
+//!   in `APPLY.md`).
+//! - Engram compatibility (doc in the M2 plan; `Slot::push_context` loop
+//!   is inside the scheduler patch, not here).
+
+use std::collections::HashMap;
+use std::sync::Arc;
+use std::sync::atomic::{AtomicU64, Ordering};
+use std::time::Instant;
+
+// ---------------------------------------------------------------------------
+// KV block placeholder — byte-level handle for J2 FFI wiring
+// ---------------------------------------------------------------------------
+
+/// Opaque handle to a serialized KV cache block.
+///
+/// In J1 this was a pure PoC; in J2 the scheduler populates `seq_bytes`
+/// via `LlamaForward::save_seq_state` and consumes it via
+/// `LlamaForward::restore_seq_state`. The rest of the shape is unchanged
+/// so the J1 tests still apply verbatim.
+///
+/// `KVBlock` is immutable once inserted. Sharing is cheap via `Arc`.
+#[derive(Debug)]
+pub struct KVBlock {
+    /// Opaque ID for logging / stats correlation.
+    pub id: u32,
+    /// Serialized bytes from `llama_state_seq_get_data`.
+    pub seq_bytes: Vec<u8>,
+    /// Number of tokens this block represents (the "prefix length" it
+    /// covers). Needed so the scheduler can compute `n_hit = tokens[..n]`
+    /// and continue prefill from position `n`.
+    pub token_count: usize,
+}
+
+impl KVBlock {
+    pub fn new(id: u32, seq_bytes: Vec<u8>, token_count: usize) -> Self {
+        Self { id, seq_bytes, token_count }
+    }
+
+    /// Byte size of the serialized state (for stats + eviction accounting).
+    pub fn byte_size(&self) -> usize {
+        self.seq_bytes.len()
+    }
+}
+
+// ---------------------------------------------------------------------------
+// M2-J2 — CacheConfig (env reader + kill switch)
+// ---------------------------------------------------------------------------
+
+/// Default maximum cached bytes when unset: 1 GB (per mission spec).
+pub const DEFAULT_MAX_CACHED_BYTES: usize = 1024 * 1024 * 1024;
+
+/// Default maximum trie nodes when unset.
+pub const DEFAULT_MAX_NODES: usize = 256;
+
+/// Runtime configuration for the prefix cache, read once at scheduler
+/// construction time. Changes require a process restart — env vars are
+/// not polled.
+///
+/// ## Env vars
+///
+/// | Var | Type | Default | Meaning |
+/// |---|---|---|---|
+/// | `CHIMERE_PREFIX_CACHE` | `bool` (0/1/true/false/on/off, case-insensitive) | `0` | Master kill-switch. When **off**, the scheduler behaves bit-identically to M1 — no trie touch, no FFI save/restore. |
+/// | `CHIMERE_PREFIX_CACHE_MAX_BYTES` | `usize` | `1_073_741_824` (1 GB) | Soft upper bound on sum of `KVBlock::byte_size()`. The trie enforces this via LRU eviction at insert time. |
+/// | `CHIMERE_PREFIX_CACHE_MAX_NODES` | `usize` | `256` | Upper bound on the number of value-bearing entries (independent of bytes). |
+///
+/// ## Precedence
+///
+/// If the kill-switch is off, the other two vars are still *parsed* (so
+/// malformed values produce a clear error at startup) but the cache is
+/// not built. The scheduler's `prefix_trie: None` short-circuits every
+/// hot-path.
+#[derive(Debug, Clone, Copy)]
+pub struct CacheConfig {
+    pub enabled: bool,
+    pub max_bytes: usize,
+    pub max_nodes: usize,
+}
+
+impl CacheConfig {
+    /// Read all three env vars. Never panics — malformed integers fall
+    /// back to the defaults with an eprintln warning.
+    pub fn from_env() -> Self {
+        let enabled = std::env::var("CHIMERE_PREFIX_CACHE")
+            .map(|v| parse_bool_env(&v))
+            .unwrap_or(false);
+
+        let max_bytes = std::env::var("CHIMERE_PREFIX_CACHE_MAX_BYTES")
+            .ok()
+            .and_then(|v| {
+                v.trim().parse::<usize>().map_err(|e| {
+                    eprintln!(
+                        "[prefix_cache] CHIMERE_PREFIX_CACHE_MAX_BYTES parse error: {} \
+                         (falling back to {} B)",
+                        e, DEFAULT_MAX_CACHED_BYTES,
+                    );
+                    e
+                }).ok()
+            })
+            .unwrap_or(DEFAULT_MAX_CACHED_BYTES);
+
+        let max_nodes = std::env::var("CHIMERE_PREFIX_CACHE_MAX_NODES")
+            .ok()
+            .and_then(|v| {
+                v.trim().parse::<usize>().map_err(|e| {
+                    eprintln!(
+                        "[prefix_cache] CHIMERE_PREFIX_CACHE_MAX_NODES parse error: {} \
+                         (falling back to {})",
+                        e, DEFAULT_MAX_NODES,
+                    );
+                    e
+                }).ok()
+            })
+            .unwrap_or(DEFAULT_MAX_NODES);
+
+        // Zero budgets force the cache off even if the kill-switch is on.
+        let effectively_enabled = enabled && max_bytes > 0 && max_nodes > 0;
+        if enabled && !effectively_enabled {
+            eprintln!(
+                "[prefix_cache] CHIMERE_PREFIX_CACHE=1 but max_bytes={} max_nodes={} → \
+                 treating as disabled",
+                max_bytes, max_nodes,
+            );
+        }
+
+        Self {
+            enabled: effectively_enabled,
+            max_bytes,
+            max_nodes,
+        }
+    }
+}
+
+impl Default for CacheConfig {
+    fn default() -> Self {
+        Self {
+            enabled: false,
+            max_bytes: DEFAULT_MAX_CACHED_BYTES,
+            max_nodes: DEFAULT_MAX_NODES,
+        }
+    }
+}
+
+/// Canonical boolean parser used by several env vars in this repo.
+/// Accepts `"1"`, `"true"`, `"yes"`, `"on"` (case-insensitive) as true;
+/// everything else (including empty string) as false.
+fn parse_bool_env(s: &str) -> bool {
+    let t = s.trim();
+    if t.is_empty() {
+        return false;
+    }
+    t == "1"
+        || t.eq_ignore_ascii_case("true")
+        || t.eq_ignore_ascii_case("yes")
+        || t.eq_ignore_ascii_case("on")
+}
+
+// ---------------------------------------------------------------------------
+// Cache statistics (J1 — unchanged)
+// ---------------------------------------------------------------------------
+
+/// Cumulative counters for `/v1/prefix_cache_stats`. Use atomics so the
+/// endpoint can read without locking the trie.
+#[derive(Debug, Default)]
+pub struct CacheStats {
+    pub hits: AtomicU64,
+    pub misses: AtomicU64,
+    pub evictions: AtomicU64,
+    /// Sum of `n_hit` across all hits — for average prefix reuse telemetry.
+    pub total_hit_tokens: AtomicU64,
+    /// Sum of `prompt_len` across all `longest_prefix` calls.
+    pub total_query_tokens: AtomicU64,
+}
+
+impl CacheStats {
+    pub fn snapshot(&self) -> CacheStatsSnapshot {
+        CacheStatsSnapshot {
+            hits: self.hits.load(Ordering::Relaxed),
+            misses: self.misses.load(Ordering::Relaxed),
+            evictions: self.evictions.load(Ordering::Relaxed),
+            total_hit_tokens: self.total_hit_tokens.load(Ordering::Relaxed),
+            total_query_tokens: self.total_query_tokens.load(Ordering::Relaxed),
+        }
+    }
+
+    pub(crate) fn record_hit(&self, n_hit: usize, prompt_len: usize) {
+        self.hits.fetch_add(1, Ordering::Relaxed);
+        self.total_hit_tokens.fetch_add(n_hit as u64, Ordering::Relaxed);
+        self.total_query_tokens.fetch_add(prompt_len as u64, Ordering::Relaxed);
+    }
+
+    pub(crate) fn record_miss(&self, prompt_len: usize) {
+        self.misses.fetch_add(1, Ordering::Relaxed);
+        self.total_query_tokens.fetch_add(prompt_len as u64, Ordering::Relaxed);
+    }
+
+    pub(crate) fn record_eviction(&self, count: u64) {
+        self.evictions.fetch_add(count, Ordering::Relaxed);
+    }
+}
+
+/// Cheap-copy snapshot for the stats endpoint.
+#[derive(Debug, Clone, Copy)]
+pub struct CacheStatsSnapshot {
+    pub hits: u64,
+    pub misses: u64,
+    pub evictions: u64,
+    pub total_hit_tokens: u64,
+    pub total_query_tokens: u64,
+}
+
+impl CacheStatsSnapshot {
+    pub fn hit_rate(&self) -> f64 {
+        let total = self.hits + self.misses;
+        if total == 0 { 0.0 } else { self.hits as f64 / total as f64 }
+    }
+    pub fn avg_hit_tokens(&self) -> f64 {
+        if self.hits == 0 { 0.0 } else { self.total_hit_tokens as f64 / self.hits as f64 }
+    }
+}
+
+// ---------------------------------------------------------------------------
+// Trie node (J1 — unchanged)
+// ---------------------------------------------------------------------------
+
+/// A PATRICIA-style trie node. The edge from parent→self is `edge_label`;
+/// children are indexed by their first token.
+struct TrieNode {
+    /// Compressed edge label (sequence of token IDs consumed along this edge).
+    /// Empty for the root node.
+    edge_label: Vec<u32>,
+    /// KV block stored at this node (if any). A node is a "cache entry" iff
+    /// this is `Some`.
+    value: Option<Arc<KVBlock>>,
+    /// Child nodes, keyed by the first token of their edge label.
+    children: HashMap<u32, TrieNode>,
+    /// Last hit timestamp (updated on `insert` and successful `longest_prefix`
+    /// descent through this node's value). Used for LRU eviction.
+    last_hit: Instant,
+}
+
+impl TrieNode {
+    fn new(edge_label: Vec<u32>) -> Self {
+        Self {
+            edge_label,
+            value: None,
+            children: HashMap::new(),
+            last_hit: Instant::now(),
+        }
+    }
+
+    /// Count the number of `value`-bearing descendants (including self).
+    fn count_values(&self) -> usize {
+        let mine = if self.value.is_some() { 1 } else { 0 };
+        let kids: usize = self.children.values().map(|c| c.count_values()).sum();
+        mine + kids
+    }
+}
+
+// ---------------------------------------------------------------------------
+// PrefixTrie (J1 API preserved; J2 adds from_config + byte-budget eviction on insert)
+// ---------------------------------------------------------------------------
+
+/// Token-keyed radix trie storing references to saved KV blocks.
+///
+/// Thread-safety: **not** `Sync`. Callers (the scheduler worker) must wrap
+/// in a `Mutex` or `RwLock`. We expect writes to be rare (once per admission
+/// that misses) and reads to be fast (once per admission), so a `RwLock`
+/// is the natural choice when M2 J4 wires this in.
+pub struct PrefixTrie {
+    root: TrieNode,
+    /// Upper bound on the number of value-bearing entries. Exceeding this
+    /// triggers LRU eviction.
+    max_nodes: usize,
+    /// Soft byte budget for the serialized KV data (sum of `byte_size()`).
+    /// 0 → no byte bound.
+    max_cached_bytes: usize,
+    pub stats: CacheStats,
+    next_block_id: u32,
+}
+
+impl PrefixTrie {
+    pub fn new(max_nodes: usize) -> Self {
+        Self {
+            root: TrieNode::new(Vec::new()),
+            max_nodes,
+            max_cached_bytes: 0,
+            stats: CacheStats::default(),
+            next_block_id: 0,
+        }
+    }
+
+    pub fn with_byte_budget(max_nodes: usize, max_cached_bytes: usize) -> Self {
+        let mut t = Self::new(max_nodes);
+        t.max_cached_bytes = max_cached_bytes;
+        t
+    }
+
+    /// M2-J2 — construct a `PrefixTrie` sized per the `CacheConfig`. The
+    /// caller is responsible for not calling this when `cfg.enabled` is
+    /// false (the scheduler just wires `prefix_trie: None` in that case).
+    pub fn from_config(cfg: &CacheConfig) -> Self {
+        Self::with_byte_budget(cfg.max_nodes, cfg.max_bytes)
+    }
+
+    /// Number of value-bearing entries currently in the trie.
+    pub fn len(&self) -> usize {
+        self.root.count_values()
+    }
+
+    pub fn is_empty(&self) -> bool { self.len() == 0 }
+
+    /// Insert `tokens → kv_handle`. Returns `true` if this is a fresh entry,
+    /// `false` if it replaced an existing value at the same prefix.
+    ///
+    /// After insertion, evicts LRU entries until both `len() <= max_nodes`
+    /// AND `cached_bytes() <= max_cached_bytes` (when the byte budget is
+    /// non-zero).
+    pub fn insert(&mut self, tokens: &[u32], kv_handle: Arc<KVBlock>) -> bool {
+        let fresh = Self::insert_rec(&mut self.root, tokens, kv_handle);
+        // Node-count bound.
+        while self.len() > self.max_nodes {
+            if !self.evict_one() { break; }
+        }
+        // Byte-count bound (M2-J2). Only active when max_cached_bytes > 0.
+        if self.max_cached_bytes > 0 {
+            while self.cached_bytes() > self.max_cached_bytes {
+                if !self.evict_one() { break; }
+            }
+        }
+        fresh
+    }
+
+    fn insert_rec(node: &mut TrieNode, tokens: &[u32], kv: Arc<KVBlock>) -> bool {
+        node.last_hit = Instant::now();
+        if tokens.is_empty() {
+            let fresh = node.value.is_none();
+            node.value = Some(kv);
+            return fresh;
+        }
+
+        let first = tokens[0];
+        if let Some(child) = node.children.get_mut(&first) {
+            let common = common_prefix_len(tokens, &child.edge_label);
+            if common == child.edge_label.len() {
+                return Self::insert_rec(child, &tokens[common..], kv);
+            }
+            // Partial match → split the edge at `common`.
+            let mut old_child = node.children.remove(&first).unwrap();
+            let split_label: Vec<u32> = old_child.edge_label[common..].to_vec();
+            old_child.edge_label = split_label.clone();
+
+            let mut intermediate = TrieNode::new(tokens[..common].to_vec());
+            intermediate.last_hit = Instant::now();
+
+            if !split_label.is_empty() {
+                intermediate.children.insert(split_label[0], old_child);
+            } else {
+                intermediate.value = old_child.value.take();
+            }
+
+            let remainder = &tokens[common..];
+            let fresh = Self::insert_rec(&mut intermediate, remainder, kv);
+
+            node.children.insert(first, intermediate);
+            fresh
+        } else {
+            let mut fresh_node = TrieNode::new(tokens.to_vec());
+            fresh_node.last_hit = Instant::now();
+            fresh_node.value = Some(kv);
+            node.children.insert(first, fresh_node);
+            true
+        }
+    }
+
+    /// Find the longest prefix of `tokens` that has an associated KV block.
+    /// Returns `(n_hit, kv_handle)` where `n_hit` is the number of tokens
+    /// covered by the cached block (≤ `tokens.len()`). Returns `None` if
+    /// no prefix matches at all (not even the empty root).
+    ///
+    /// Side effect: updates `last_hit` on the winning node so it is fresh
+    /// for LRU.
+    pub fn longest_prefix(&mut self, tokens: &[u32]) -> Option<(usize, Arc<KVBlock>)> {
+        let result = Self::longest_prefix_rec(&mut self.root, tokens, 0);
+        match &result {
+            Some((n, _)) => self.stats.record_hit(*n, tokens.len()),
+            None => self.stats.record_miss(tokens.len()),
+        }
+        result
+    }
+
+    fn longest_prefix_rec(
+        node: &mut TrieNode,
+        tokens: &[u32],
+        depth: usize,
+    ) -> Option<(usize, Arc<KVBlock>)> {
+        let mut best: Option<(usize, Arc<KVBlock>)> = node.value.as_ref()
+            .map(|kv| (depth, Arc::clone(kv)));
+        if best.is_some() {
+            node.last_hit = Instant::now();
+        }
+
+        if let Some(first) = tokens.first().copied() {
+            if let Some(child) = node.children.get_mut(&first) {
+                let common = common_prefix_len(tokens, &child.edge_label);
+                if common == child.edge_label.len() {
+                    let child_depth = depth + common;
+                    if let Some(deeper) =
+                        Self::longest_prefix_rec(child, &tokens[common..], child_depth)
+                    {
+                        return Some(deeper);
+                    }
+                }
+                // Partial edge match: `best` from above is the answer.
+                // (Falls through to the return below.)
+            }
+        }
+        best
+    }
+
+    /// Evict entries (oldest `last_hit` first) until `len() ≤ keep`.
+    /// Returns the number of entries evicted.
+    pub fn evict_lru(&mut self, keep: usize) -> u64 {
+        let mut count = 0u64;
+        while self.len() > keep {
+            if !self.evict_one() { break; }
+            count += 1;
+        }
+        self.stats.record_eviction(count);
+        count
+    }
+
+    /// Evict the single oldest entry. Returns `false` if the trie is empty.
+    fn evict_one(&mut self) -> bool {
+        let victim_path = self.find_lru_path();
+        let Some(path) = victim_path else { return false };
+        let cleared = Self::clear_value_at(&mut self.root, &path);
+        if cleared {
+            self.stats.record_eviction(1);
+        }
+        cleared
+    }
+
+    /// DFS to find the path (sequence of first-tokens) leading to the
+    /// value-bearing node with the oldest `last_hit`.
+    fn find_lru_path(&self) -> Option<Vec<u32>> {
+        fn dfs(
+            node: &TrieNode,
+            cur_path: &mut Vec<u32>,
+            best: &mut Option<(Instant, Vec<u32>)>,
+        ) {
+            if node.value.is_some() {
+                let replace = match best {
+                    None => true,
+                    Some((ts, _)) => node.last_hit < *ts,
+                };
+                if replace {
+                    *best = Some((node.last_hit, cur_path.clone()));
+                }
+            }
+            for (&key, child) in &node.children {
+                cur_path.push(key);
+                dfs(child, cur_path, best);
+                cur_path.pop();
+            }
+        }
+        let mut best: Option<(Instant, Vec<u32>)> = None;
+        let mut cur = Vec::new();
+        dfs(&self.root, &mut cur, &mut best);
+        best.map(|(_, path)| path)
+    }
+
+    fn clear_value_at(node: &mut TrieNode, path: &[u32]) -> bool {
+        if path.is_empty() {
+            return node.value.take().is_some();
+        }
+        let Some(child) = node.children.get_mut(&path[0]) else { return false };
+        Self::clear_value_at(child, &path[1..])
+    }
+
+    /// Allocate the next KVBlock ID. Used by callers that construct
+    /// `KVBlock` instances (typically the scheduler after a successful
+    /// prefill → `save_seq_state`).
+    pub fn next_kv_id(&mut self) -> u32 {
+        let id = self.next_block_id;
+        self.next_block_id = self.next_block_id.wrapping_add(1);
+        id
+    }
+
+    /// Total serialized bytes across all cached blocks.
+    pub fn cached_bytes(&self) -> usize {
+        fn walk(node: &TrieNode, acc: &mut usize) {
+            if let Some(ref v) = node.value { *acc += v.byte_size(); }
+            for c in node.children.values() { walk(c, acc); }
+        }
+        let mut acc = 0;
+        walk(&self.root, &mut acc);
+        acc
+    }
+}
+
+/// Length of the shared prefix between two token slices.
+fn common_prefix_len(a: &[u32], b: &[u32]) -> usize {
+    a.iter().zip(b.iter()).take_while(|(x, y)| x == y).count()
+}
+
+// ---------------------------------------------------------------------------
+// Tests — J1 set (11 tests) + J2 additions (CacheConfig + byte-budget eviction)
+// ---------------------------------------------------------------------------
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    fn mk_block(id: u32, n_tokens: usize) -> Arc<KVBlock> {
+        Arc::new(KVBlock::new(id, vec![0u8; n_tokens * 128], n_tokens))
+    }
+
+    fn mk_block_bytes(id: u32, n_bytes: usize) -> Arc<KVBlock> {
+        Arc::new(KVBlock::new(id, vec![0u8; n_bytes], n_bytes / 128))
+    }
+
+    #[test]
+    fn empty_trie_returns_none() {
+        let mut t = PrefixTrie::new(16);
+        assert!(t.is_empty());
+        assert!(t.longest_prefix(&[1, 2, 3]).is_none());
+        let s = t.stats.snapshot();
+        assert_eq!(s.misses, 1);
+        assert_eq!(s.hits, 0);
+    }
+
+    #[test]
+    fn insert_and_longest_prefix_exact_match() {
+        let mut t = PrefixTrie::new(16);
+        let tokens = vec![10u32, 20, 30, 40];
+        let kv = mk_block(1, 4);
+        assert!(t.insert(&tokens, Arc::clone(&kv)));
+        assert_eq!(t.len(), 1);
+
+        let hit = t.longest_prefix(&tokens).expect("exact match");
+        assert_eq!(hit.0, 4);
+        assert_eq!(hit.1.id, 1);
+    }
+
+    #[test]
+    fn shared_prefix_returns_longest() {
+        let mut t = PrefixTrie::new(16);
+        t.insert(&[1, 2, 3], mk_block(1, 3));
+        t.insert(&[1, 2, 3, 4, 5], mk_block(2, 5));
+        t.insert(&[1, 2, 3, 4, 5, 6, 7, 8], mk_block(3, 8));
+        assert_eq!(t.len(), 3);
+
+        let query = vec![1u32, 2, 3, 4, 5, 6, 7, 8, 9];
+        let (n_hit, kv) = t.longest_prefix(&query).expect("should hit");
+        assert_eq!(n_hit, 8);
+        assert_eq!(kv.id, 3);
+    }
+
+    #[test]
+    fn partial_prefix_match_returns_shortest_valid() {
+        let mut t = PrefixTrie::new(16);
+        t.insert(&[1, 2, 3, 4, 5], mk_block(1, 5));
+        let hit = t.longest_prefix(&[1, 2, 3, 9, 9]);
+        assert!(hit.is_none(), "partial edge should not count as a hit");
+
+        t.insert(&[1, 2, 3], mk_block(2, 3));
+        let (n_hit, kv) = t.longest_prefix(&[1, 2, 3, 9, 9]).unwrap();
+        assert_eq!(n_hit, 3);
+        assert_eq!(kv.id, 2);
+    }
+
+    #[test]
+    fn no_match_returns_none() {
+        let mut t = PrefixTrie::new(16);
+        t.insert(&[100, 200, 300], mk_block(1, 3));
+        let hit = t.longest_prefix(&[1, 2, 3]);
+        assert!(hit.is_none());
+    }
+
+    #[test]
+    fn insert_beyond_max_nodes_evicts_lru() {
+        let mut t = PrefixTrie::new(2);
+        t.insert(&[1, 2, 3], mk_block(1, 3));
+        std::thread::sleep(std::time::Duration::from_millis(5));
+        t.insert(&[4, 5, 6], mk_block(2, 3));
+        std::thread::sleep(std::time::Duration::from_millis(5));
+        t.insert(&[7, 8, 9], mk_block(3, 3));
+        assert_eq!(t.len(), 2, "LRU eviction should cap at max_nodes");
+        assert!(t.longest_prefix(&[1, 2, 3]).is_none());
+        assert!(t.longest_prefix(&[4, 5, 6]).is_some());
+        assert!(t.longest_prefix(&[7, 8, 9]).is_some());
+
+        let s = t.stats.snapshot();
+        assert!(s.evictions >= 1);
+    }
+
+    #[test]
+    fn longest_prefix_refreshes_lru() {
+        let mut t = PrefixTrie::new(2);
+        t.insert(&[1, 2, 3], mk_block(1, 3));
+        std::thread::sleep(std::time::Duration::from_millis(5));
+        t.insert(&[4, 5, 6], mk_block(2, 3));
+        std::thread::sleep(std::time::Duration::from_millis(5));
+        let _ = t.longest_prefix(&[1, 2, 3]);
+        std::thread::sleep(std::time::Duration::from_millis(5));
+        t.insert(&[7, 8, 9], mk_block(3, 3));
+        assert_eq!(t.len(), 2);
+        assert!(t.longest_prefix(&[1, 2, 3]).is_some(),
+                "refreshed entry should survive");
+        assert!(t.longest_prefix(&[4, 5, 6]).is_none(),
+                "un-touched entry should be evicted");
+    }
+
+    #[test]
+    fn evict_lru_keeps_exactly_n() {
+        let mut t = PrefixTrie::new(100);
+        for i in 0..10u32 {
+            t.insert(&[i, i + 1, i + 2], mk_block(i, 3));
+            std::thread::sleep(std::time::Duration::from_millis(1));
+        }
+        assert_eq!(t.len(), 10);
+        let evicted = t.evict_lru(3);
+        assert_eq!(evicted, 7);
+        assert_eq!(t.len(), 3);
+    }
+
+    #[test]
+    fn edge_split_preserves_existing_entries() {
+        let mut t = PrefixTrie::new(16);
+        t.insert(&[1, 2, 3, 4, 5], mk_block(1, 5));
+        t.insert(&[1, 2, 3], mk_block(2, 3));
+        assert_eq!(t.len(), 2);
+
+        let (n, kv) = t.longest_prefix(&[1, 2, 3, 4, 5, 6]).unwrap();
+        assert_eq!(n, 5);
+        assert_eq!(kv.id, 1);
+
+        let (n, kv) = t.longest_prefix(&[1, 2, 3, 9]).unwrap();
+        assert_eq!(n, 3);
+        assert_eq!(kv.id, 2);
+    }
+
+    #[test]
+    fn cache_stats_track_hit_rate() {
+        let mut t = PrefixTrie::new(16);
+        t.insert(&[1, 2, 3], mk_block(1, 3));
+
+        let _ = t.longest_prefix(&[1, 2, 3]);
+        let _ = t.longest_prefix(&[1, 2, 3, 4]);
+        let _ = t.longest_prefix(&[99, 99]);
+
+        let s = t.stats.snapshot();
+        assert_eq!(s.hits, 2);
+        assert_eq!(s.misses, 1);
+        assert!((s.hit_rate() - 2.0 / 3.0).abs() < 1e-9);
+        assert_eq!(s.total_hit_tokens, 6);
+        assert!((s.avg_hit_tokens() - 3.0).abs() < 1e-9);
+    }
+
+    #[test]
+    fn cached_bytes_sums_all_blocks() {
+        let mut t = PrefixTrie::new(16);
+        t.insert(&[1, 2, 3], mk_block(1, 3));
+        t.insert(&[4, 5, 6, 7, 8], mk_block(2, 5));
+        assert_eq!(t.cached_bytes(), 384 + 640);
+    }
+
+    #[test]
+    fn insert_replacement_is_not_fresh() {
+        let mut t = PrefixTrie::new(16);
+        assert!(t.insert(&[1, 2, 3], mk_block(1, 3)));
+        assert!(!t.insert(&[1, 2, 3], mk_block(2, 3)));
+        let (_, kv) = t.longest_prefix(&[1, 2, 3]).unwrap();
+        assert_eq!(kv.id, 2, "second insert should replace the block");
+    }
+
+    #[test]
+    fn common_prefix_len_basic() {
+        assert_eq!(common_prefix_len(&[1, 2, 3], &[1, 2, 3]), 3);
+        assert_eq!(common_prefix_len(&[1, 2, 3], &[1, 2, 9]), 2);
+        assert_eq!(common_prefix_len(&[1, 2, 3], &[9]), 0);
+        assert_eq!(common_prefix_len(&[], &[1]), 0);
+    }
+
+    // ----------------------------------------------------------------
+    // M2-J2 — CacheConfig + byte-budget eviction
+    // ----------------------------------------------------------------
+
+    #[test]
+    fn cache_config_default_is_disabled() {
+        let cfg = CacheConfig::default();
+        assert!(!cfg.enabled);
+        assert_eq!(cfg.max_bytes, DEFAULT_MAX_CACHED_BYTES);
+        assert_eq!(cfg.max_nodes, DEFAULT_MAX_NODES);
+    }
+
+    #[test]
+    fn parse_bool_env_accepts_common_forms() {
+        for s in ["1", "true", "TRUE", "True", "yes", "on", "YeS", "On"] {
+            assert!(parse_bool_env(s), "expected true for {:?}", s);
+        }
+        for s in ["0", "false", "no", "off", "", "   ", "garbage"] {
+            assert!(!parse_bool_env(s), "expected false for {:?}", s);
+        }
+    }
+
+    #[test]
+    fn from_config_applies_byte_budget() {
+        let cfg = CacheConfig { enabled: true, max_bytes: 1024, max_nodes: 100 };
+        let t = PrefixTrie::from_config(&cfg);
+        assert_eq!(t.max_cached_bytes, 1024);
+        assert_eq!(t.max_nodes, 100);
+    }
+
+    #[test]
+    fn byte_budget_evicts_on_overflow() {
+        // max_bytes = 400 — room for exactly one 384-byte block (3 tokens
+        // * 128 B mk_block), second insert must evict the first.
+        let mut t = PrefixTrie::with_byte_budget(16, 400);
+        t.insert(&[1, 2, 3], mk_block_bytes(1, 384));
+        std::thread::sleep(std::time::Duration::from_millis(5));
+        // Inserting a second 384-byte block takes us to 768 > 400 → evict.
+        t.insert(&[4, 5, 6], mk_block_bytes(2, 384));
+        assert!(t.cached_bytes() <= 400, "byte budget must be honoured");
+        assert_eq!(t.len(), 1, "byte budget should keep one entry");
+        // The survivor is the newly-inserted (freshest) one.
+        assert!(t.longest_prefix(&[4, 5, 6]).is_some());
+        assert!(t.longest_prefix(&[1, 2, 3]).is_none());
+    }
+
+    #[test]
+    fn zero_byte_budget_means_unbounded_bytes() {
+        // with_byte_budget(..., 0) behaves like `new` — only node count matters.
+        let mut t = PrefixTrie::with_byte_budget(16, 0);
+        for i in 0..10u32 {
+            t.insert(&[i, i + 1, i + 2], mk_block_bytes(i, 10_000));
+        }
+        assert_eq!(t.len(), 10);
+        assert_eq!(t.cached_bytes(), 100_000);
+    }
+}
diff --git a/chimere-server/src/server.rs b/chimere-server/src/server.rs
index 577dd48..d4351bc 100644
--- a/chimere-server/src/server.rs
+++ b/chimere-server/src/server.rs
@@ -1948,6 +1948,23 @@ async fn profile_reset() -> impl IntoResponse {
     StatusCode::NO_CONTENT
 }
 
+/// GET /v1/prefix_cache_stats — JSON snapshot of the M2 prefix cache.
+///
+/// Returns `{enabled: false, reason: ...}` when the cache is off (env gate
+/// or kill switch). When on, returns hits/misses/evictions/hit_rate plus
+/// trie length and cached bytes. Non-blocking (uses `try_read` on the trie
+/// `RwLock`); returns `{busy: true}` if write-locked at request time.
+async fn prefix_cache_stats_handler(State(state): State<Arc<AppState>>) -> impl IntoResponse {
+    let body = match state.native_scheduler.as_ref() {
+        Some(sched) => sched.prefix_cache_stats_json(),
+        None => serde_json::json!({
+            "enabled": false,
+            "reason": "native scheduler not active (CHIMERE_MULTISLOT_NATIVE != 1)"
+        }),
+    };
+    Json(body)
+}
+
 // ---------------------------------------------------------------------------
 // Router factory
 // ---------------------------------------------------------------------------
@@ -1961,5 +1978,6 @@ pub fn build_router(state: Arc<AppState>) -> Router {
         .route("/v1/status", get(status_handler))
         .route("/v1/profile", get(profile_report))
         .route("/v1/profile/reset", post(profile_reset))
+        .route("/v1/prefix_cache_stats", get(prefix_cache_stats_handler))
         .with_state(state)
 }
diff --git a/chimere-server/src/slot_scheduler.rs b/chimere-server/src/slot_scheduler.rs
index 3c0af59..d6a2c37 100644
--- a/chimere-server/src/slot_scheduler.rs
+++ b/chimere-server/src/slot_scheduler.rs
@@ -1121,6 +1121,22 @@ pub struct NativeScheduler {
     /// `/metrics` scrape and `/v1/status` handlers. `Relaxed` is sufficient
     /// — observability gauge, eventual consistency is fine.
     active_count: Arc<AtomicUsize>,
+
+    // ---------- M2-J2c — prompt-prefix cache wiring ----------
+    /// Optional prompt-prefix cache. `None` when `CHIMERE_PREFIX_CACHE=0`
+    /// (bit-identical to M1 behaviour).
+    ///
+    /// Shared behind `Arc<RwLock<...>>` so the `/v1/prefix_cache_stats`
+    /// endpoint (M2-J6) can read without blocking the driver, and so
+    /// multiple drivers (future: prefill vs decode split) could share
+    /// one trie without rewrites.
+    prefix_trie: Option<Arc<std::sync::RwLock<crate::prefix_cache::PrefixTrie>>>,
+
+    /// Gate derived from `CacheConfig::from_env().enabled` at construction
+    /// time AND the presence of a `prefix_trie`. When `false`, every trie
+    /// touch is skipped even if `prefix_trie` is `Some(...)` — belt-and-
+    /// braces kill switch for rollback scenarios.
+    prefix_cache_enabled: bool,
 }
 
 impl NativeScheduler {
@@ -1136,6 +1152,46 @@ impl NativeScheduler {
     }
     pub fn queue_depth_or_default(&self) -> usize { 0 }
 
+    /// M2-J6 — JSON snapshot of the prefix-cache state for `/v1/prefix_cache_stats`.
+    /// Never blocks the driver: uses `try_read` on the trie's `RwLock` and
+    /// returns `{busy: true}` if currently write-locked.
+    pub fn prefix_cache_stats_json(&self) -> serde_json::Value {
+        let Some(trie_arc) = self.prefix_trie.as_ref() else {
+            return serde_json::json!({
+                "enabled": false,
+                "reason": "prefix_trie absent (CHIMERE_PREFIX_CACHE=0 or unset)"
+            });
+        };
+        if !self.prefix_cache_enabled {
+            return serde_json::json!({
+                "enabled": false,
+                "reason": "kill switch (prefix_cache_enabled=false)"
+            });
+        }
+        match trie_arc.try_read() {
+            Ok(trie) => {
+                let snap = trie.stats.snapshot();
+                serde_json::json!({
+                    "enabled": true,
+                    "len": trie.len(),
+                    "cached_bytes": trie.cached_bytes(),
+                    "hits": snap.hits,
+                    "misses": snap.misses,
+                    "evictions": snap.evictions,
+                    "total_hit_tokens": snap.total_hit_tokens,
+                    "total_query_tokens": snap.total_query_tokens,
+                    "hit_rate": snap.hit_rate(),
+                    "avg_hit_tokens": snap.avg_hit_tokens(),
+                })
+            }
+            Err(_) => serde_json::json!({
+                "enabled": true,
+                "busy": true,
+                "reason": "trie write-locked, retry"
+            }),
+        }
+    }
+
     /// Build a native scheduler. The `LlamaForward` is NOT stored here —
     /// it is passed to `spawn_native_driver` which moves it into the
     /// dedicated driver thread.
@@ -1168,9 +1224,50 @@ impl NativeScheduler {
             default_engram_alpha,
             shutdown: Arc::new(AtomicBool::new(false)),
             active_count: Arc::new(AtomicUsize::new(0)),
+            // M2-J2c: prefix cache is off by default; install via `with_prefix_cache`.
+            prefix_trie: None,
+            prefix_cache_enabled: false,
         })
     }
 
+    /// M2-J2c — attach a prompt-prefix cache trie to this scheduler.
+    ///
+    /// Must be called BEFORE [`spawn_native_driver`](Self::spawn_native_driver),
+    /// otherwise the driver closure has already captured `prefix_trie: None`
+    /// and no subsequent change is observed.
+    ///
+    /// # Gate semantics (kill switch)
+    ///
+    /// The effective "cache on" state requires both:
+    /// 1. `trie.is_some()` (caller actually passed a trie in), AND
+    /// 2. [`CacheConfig::from_env()`](crate::prefix_cache::CacheConfig::from_env)
+    ///    reports `enabled == true` (i.e. `CHIMERE_PREFIX_CACHE=1` and
+    ///    non-zero byte/node budgets).
+    ///
+    /// When (1) is true but (2) is false, the trie is kept (so
+    /// `/v1/prefix_cache_stats` still shows the empty trie) but the hot
+    /// paths in `NativeDriver` skip every cache operation — a warning is
+    /// logged to make the discrepancy visible at review time.
+    ///
+    /// When `None` is passed, behaviour is bit-identical to M1
+    /// (no trie touch, no FFI save/restore, no gate check at all).
+    pub fn with_prefix_cache(
+        mut self,
+        trie: Option<Arc<std::sync::RwLock<crate::prefix_cache::PrefixTrie>>>,
+    ) -> Self {
+        let cfg_enabled = crate::prefix_cache::CacheConfig::from_env().enabled;
+        let enabled = trie.is_some() && cfg_enabled;
+        if trie.is_some() && !cfg_enabled {
+            eprintln!(
+                "[slot_scheduler:native] prefix_trie provided but \
+                 CHIMERE_PREFIX_CACHE disabled — cache inert (bit-identical to M1)"
+            );
+        }
+        self.prefix_trie = trie;
+        self.prefix_cache_enabled = enabled;
+        self
+    }
+
     /// Clone the admission sender. HTTP handlers use this to enqueue
     /// `NativeScheduledRequest`.
     pub fn admission_tx(&self) -> mpsc::Sender<NativeScheduledRequest> {
@@ -1253,10 +1350,20 @@ impl NativeScheduler {
             })
             .unwrap_or(256);
 
+        // M2-J2c — snapshot the prefix-cache handle + gate for the driver
+        // thread. Clone `Arc` (cheap, bumps a refcount). When the gate is
+        // off, `prefix_trie` may still be Some(...) for `/v1/prefix_cache_stats`
+        // readers, but the driver hot paths skip every trie touch.
+        let prefix_trie = self.prefix_trie.clone();
+        let prefix_cache_enabled = self.prefix_cache_enabled;
+
         eprintln!(
             "[slot_scheduler:native] driver spawning: num_slots={}, tick_us={}, \
-             max_prefill_chunk={}, engram_attached={}",
-            num_slots, tick_us, max_prefill_chunk, engram_global.is_some(),
+             max_prefill_chunk={}, engram_attached={}, prefix_cache_enabled={}, \
+             prefix_trie_attached={}",
+            num_slots, tick_us, max_prefill_chunk,
+            engram_global.is_some(),
+            prefix_cache_enabled, prefix_trie.is_some(),
         );
 
         let handle = std::thread::Builder::new()
@@ -1270,6 +1377,8 @@ impl NativeScheduler {
                     max_prefill_chunk,
                     tick_us,
                     active_count,
+                    prefix_trie,
+                    prefix_cache_enabled,
                 };
                 driver.run();
             })
@@ -1292,6 +1401,17 @@ struct NativeDriver {
     /// so the `/metrics` scrape sees a live occupancy gauge rather than
     /// a hardcoded zero.
     active_count: Arc<AtomicUsize>,
+
+    // ---------- M2-J2c — prompt-prefix cache ----------
+    /// Optional shared trie. Reads/writes go through an `RwLock` so the
+    /// stats endpoint (M2-J6) and the driver can coexist without lock
+    /// contention on the hot read path.
+    prefix_trie: Option<Arc<std::sync::RwLock<crate::prefix_cache::PrefixTrie>>>,
+    /// Gate: when `false`, every cache touch is skipped. Equivalent to
+    /// `prefix_trie.is_none()` at steady state, but kept separately so
+    /// operators can run with a live trie but inert cache during rollback
+    /// smoke-tests (belt-and-braces).
+    prefix_cache_enabled: bool,
 }
 
 impl NativeDriver {
@@ -1387,6 +1507,33 @@ impl NativeDriver {
 
     /// Seat a new request in the first free slot. Caller MUST have
     /// verified a free slot is available via `alloc_free().is_some()`.
+    ///
+    /// # M2-J2c — prompt-prefix cache admission
+    ///
+    /// When the prefix cache is enabled (both `self.prefix_cache_enabled`
+    /// and `self.prefix_trie.is_some()`), this method:
+    ///
+    /// 1. Looks up the longest prefix of the new request's `prompt_tokens`
+    ///    already present in the trie (`longest_prefix` under a short
+    ///    write-lock — the call updates `last_hit` for LRU).
+    /// 2. On hit (`n_hit > 0`), calls
+    ///    [`LlamaForward::restore_seq_state`] + [`set_pos`] to rehydrate
+    ///    the KV/GDN state into the slot's `seq_id`. The blob is
+    ///    seq_id-independent (see M2-J2b doc).
+    /// 3. Computes `chunks_done = n_hit / max_prefill_chunk` and
+    ///    `rounded_skip = chunks_done * max_prefill_chunk`. If the hit is
+    ///    sub-chunk-aligned (`n_hit % max_prefill_chunk != 0`), the restore
+    ///    is REVERTED (clear seq, reset pos) because ik_llama has no
+    ///    `kv_cache_seq_rm(p0, p1)` for partial-chunk trims. Safe default
+    ///    = cold start.
+    /// 4. On any FFI error, clears the seq and falls back to cold.
+    /// 5. The full `prompt_tokens` are pushed into `recent_context`
+    ///    regardless of hit/miss, so the engram n-gram lookup behaves
+    ///    identically on warm paths (per M2 plan § 5).
+    ///
+    /// Kill-switch: when `self.prefix_cache_enabled == false` (or the
+    /// trie is None), this method reduces to the original M1 body —
+    /// no RwLock touch, no FFI save/restore, no log noise.
     fn seat_request(&mut self, req: NativeScheduledRequest) {
         // Defense: an empty prompt would cause `end-start-1` to underflow
         // in `tick_prefill_one`. Reject loudly, don't silently hang.
@@ -1399,43 +1546,174 @@ impl NativeDriver {
             });
             return;
         }
-        let free = match self.pool.alloc_free() {
+
+        // Fill request-scoped fields first, inside a scoped block so the
+        // `&mut Slot` borrow on `self.pool` ends before we reach the FFI
+        // calls below. We DELIBERATELY do not touch `state` / `pos` here —
+        // those may be adjusted by the prefix-cache hit path.
+        let (slot_id, prompt_tokens_clone): (u32, Vec<u32>) = {
+            let free = match self.pool.alloc_free() {
+                Some(s) => s,
+                None => {
+                    // Shouldn't happen given the caller's precondition, but
+                    // defend by sending an Error and moving on.
+                    let _ = req.tx.try_send(StreamMsg::Error {
+                        message: "No free slot after admission — race condition".to_string(),
+                    });
+                    return;
+                }
+            };
+            eprintln!(
+                "[slot_scheduler:native] seat req={} on slot {} (prompt={} toks, max={}, wait_ms={})",
+                req.request_id,
+                free.id,
+                req.prompt_tokens.len(),
+                req.params.max_tokens,
+                req.enqueued_at.elapsed().as_millis(),
+            );
+            // Reset slot to a clean state (defensive — mark_free was called
+            // when the previous tenant vacated).
+            free.mark_free();
+            free.prompt_tokens = req.prompt_tokens;
+            free.params = req.params;
+            free.engram_alpha = req.engram_alpha;
+            free.tx = Some(req.tx);
+            free.want_logprobs = req.want_logprobs;
+            free.request_id = req.request_id;
+            free.cancelled = req.cancelled;
+            free.thinking = free.params.enable_thinking;
+            free.stats.prompt_tokens = free.prompt_tokens.len() as u32;
+
+            let slot_id = free.id;
+            let prompt_tokens_clone = free.prompt_tokens.clone();
+            (slot_id, prompt_tokens_clone)
+            // `free: &mut Slot` dropped here → `self.pool` free again.
+        };
+
+        // -----------------------------------------------------------------
+        // M2-J2c — prefix-cache lookup + restore (gated).
+        //
+        // Outputs of this block:
+        //   * `prefill_skip: usize` = number of prompt tokens already in
+        //     the KV cache after restore. 0 = cold start.
+        //   * On FFI/alignment failure we clear the seq, reset pos, and
+        //     leave `prefill_skip = 0`.
+        // -----------------------------------------------------------------
+        let mut prefill_skip: usize = 0;
+
+        if self.prefix_cache_enabled {
+            if let Some(trie_arc) = self.prefix_trie.as_ref() {
+                // 1. Look up the longest matching prefix under a short
+                //    write-lock (`longest_prefix` updates `last_hit`).
+                //    We intentionally drop the lock BEFORE FFI — the
+                //    restore call is ~ms-scale and must not block stats
+                //    readers.
+                let lookup = match trie_arc.write() {
+                    Ok(mut g) => g.longest_prefix(&prompt_tokens_clone),
+                    Err(poisoned) => {
+                        eprintln!(
+                            "[prefix_cache] trie RwLock poisoned — falling back to cold: {}",
+                            poisoned,
+                        );
+                        None
+                    }
+                };
+
+                // 2. On hit, do the restore + alignment dance.
+                if let Some((n_hit, kv)) = lookup {
+                    if n_hit > 0 {
+                        // Attempt FFI restore. On failure, clear and cold.
+                        let restore_ok = match self.llama.restore_seq_state(
+                            slot_id as i32,
+                            &kv.seq_bytes,
+                        ) {
+                            Ok(()) => true,
+                            Err(e) => {
+                                eprintln!(
+                                    "[prefix_cache] restore_seq_state failed for slot={} n_hit={}: {} \
+                                     — falling back to cold",
+                                    slot_id, n_hit, e,
+                                );
+                                // Paranoia: clear any partial state.
+                                let _ = self.llama.kv_cache_seq_rm_for(slot_id as i32);
+                                false
+                            }
+                        };
+
+                        if restore_ok {
+                            // Chunk alignment — ik_llama's fork has no
+                            // `kv_cache_seq_rm(seq, p0, p1)` API; we can
+                            // only start `tick_prefill_one` at chunk
+                            // boundaries.
+                            let chunks_done = n_hit / self.max_prefill_chunk;
+                            let rounded_skip = chunks_done * self.max_prefill_chunk;
+
+                            if rounded_skip == 0 {
+                                // Hit covers less than one chunk → not
+                                // worth the restore; bail to cold.
+                                eprintln!(
+                                    "[prefix_cache] hit slot={} n_hit={}/{} < max_chunk={} \
+                                     — sub-chunk hit, clearing and starting cold",
+                                    slot_id, n_hit, prompt_tokens_clone.len(),
+                                    self.max_prefill_chunk,
+                                );
+                                let _ = self.llama.kv_cache_seq_rm_for(slot_id as i32);
+                                self.llama.set_pos(0);
+                                prefill_skip = 0;
+                            } else {
+                                self.llama.set_pos(rounded_skip as i32);
+                                prefill_skip = rounded_skip;
+                                eprintln!(
+                                    "[prefix_cache] hit slot={} n_hit={}/{} chunks_done={} \
+                                     rounded_skip={} kv_id={} kv_bytes={}",
+                                    slot_id, n_hit, prompt_tokens_clone.len(),
+                                    chunks_done, rounded_skip,
+                                    kv.id, kv.byte_size(),
+                                );
+                            }
+                        }
+                    } else {
+                        // n_hit == 0 (empty-root hit) — treat as miss.
+                        eprintln!(
+                            "[prefix_cache] miss slot={} prompt_len={} (empty-root hit ignored)",
+                            slot_id, prompt_tokens_clone.len(),
+                        );
+                    }
+                } else {
+                    eprintln!(
+                        "[prefix_cache] miss slot={} prompt_len={}",
+                        slot_id, prompt_tokens_clone.len(),
+                    );
+                }
+            }
+        }
+
+        // -----------------------------------------------------------------
+        // Re-acquire the slot and seat the request. `chunks_done` =
+        // number of prefill chunks ALREADY consumed by the cache restore.
+        // `tick_prefill_one` will start slicing at
+        // `start = chunks_done * max_prefill_chunk`, naturally skipping
+        // the cached tokens and emitting only the fresh tail.
+        // -----------------------------------------------------------------
+        let chunks_done = prefill_skip / self.max_prefill_chunk;
+        let free = match self.pool.get_mut(slot_id) {
             Some(s) => s,
             None => {
-                // Shouldn't happen given the caller's precondition, but
-                // defend by sending an Error and moving on.
-                let _ = req.tx.try_send(StreamMsg::Error {
-                    message: "No free slot after admission — race condition".to_string(),
-                });
+                // Shouldn't happen — we just mark_free'd this slot.
+                eprintln!(
+                    "[slot_scheduler:native] BUG: slot {} vanished between \
+                     alloc_free and seat_request completion",
+                    slot_id,
+                );
                 return;
             }
         };
-        eprintln!(
-            "[slot_scheduler:native] seat req={} on slot {} (prompt={} toks, max={}, wait_ms={})",
-            req.request_id,
-            free.id,
-            req.prompt_tokens.len(),
-            req.params.max_tokens,
-            req.enqueued_at.elapsed().as_millis(),
-        );
-        // Reset slot to a clean state (defensive — mark_free was called
-        // when the previous tenant vacated).
-        free.mark_free();
-        free.state = SlotState::Prefilling { chunks_done: 0 };
-        free.pos = 0;
-        free.prompt_tokens = req.prompt_tokens;
-        free.params = req.params;
-        free.engram_alpha = req.engram_alpha;
-        free.tx = Some(req.tx);
-        free.want_logprobs = req.want_logprobs;
-        free.request_id = req.request_id;
-        free.cancelled = req.cancelled;
-        free.thinking = free.params.enable_thinking;
-        free.stats.prompt_tokens = free.prompt_tokens.len() as u32;
-        // Seed the recent_context with the prompt tail for engram lookups.
-        // We push the whole prompt so the first gen step's engram query
-        // has full context; Slot::push_context bounds the window to 256.
-        let prompt_tokens_clone = free.prompt_tokens.clone();
+        free.state = SlotState::Prefilling { chunks_done };
+        free.pos = prefill_skip as i32;
+
+        // Engram seed: push the FULL prompt regardless of hit/miss so
+        // the first gen step's n-gram query has identical context in
+        // warm and cold paths (M2 plan § 5).
         for t in prompt_tokens_clone {
             free.push_context(t);
         }
@@ -1729,6 +2007,25 @@ impl NativeDriver {
 
     /// Reap every Draining slot: emit exactly one Done frame, release KV
     /// pages for the seq_id, mark the slot Free.
+    ///
+    /// # M2-J2c — prompt-prefix cache insertion
+    ///
+    /// For each Draining slot, BEFORE `kv_cache_seq_rm_for`, this method
+    /// may snapshot the slot's KV/GDN state via
+    /// [`LlamaForward::save_seq_state`] and insert it into the trie,
+    /// keyed on the slot's full `prompt_tokens`. Rules:
+    ///
+    /// - Gated on `self.prefix_cache_enabled && self.prefix_trie.is_some()`.
+    /// - `finish_reason` must be one of `{stop, length, cancel}`. Reason
+    ///   `error` slots are NOT cached (KV is unreliable on error paths).
+    /// - `stats.generated_tokens > 0` (slot actually reached Generating).
+    /// - `prompt_tokens` non-empty.
+    /// - On non-empty `Ok(blob)`: acquire `trie.write()`, allocate
+    ///   `next_kv_id()`, wrap in `Arc<KVBlock>`, call `insert(...)`.
+    /// - On empty blob / `Err` / poisoned lock: log and skip (never panic).
+    ///
+    /// The save MUST run before `kv_cache_seq_rm_for` — reversing the
+    /// order captures an empty blob.
     fn reap_draining(&mut self) {
         // Collect ids first to avoid borrow conflicts with llama kv_cache call.
         let draining_ids: Vec<(u32, String)> = self
@@ -1751,6 +2048,99 @@ impl NativeDriver {
                     finish_reason: reason.clone(),
                 });
             }
+
+            // --------------------------------------------------------------
+            // M2-J2c — optionally snapshot KV + insert into prefix trie.
+            // Runs BEFORE `kv_cache_seq_rm_for` so the FFI blob is valid.
+            // --------------------------------------------------------------
+            if self.prefix_cache_enabled && self.prefix_trie.is_some() {
+                // Gate: only save for successful / natural terminations.
+                // Error-path KV is unreliable and must not poison the cache.
+                let cache_decision: Option<Vec<u32>> = {
+                    let slot = self.pool.get_mut(slot_id);
+                    match slot {
+                        Some(s) => {
+                            let reason_ok = matches!(
+                                reason.as_str(),
+                                "stop" | "length" | "cancel"
+                            );
+                            let generated_ok = s.stats.generated_tokens > 0;
+                            let prompt_ok = !s.prompt_tokens.is_empty();
+                            if reason_ok && generated_ok && prompt_ok {
+                                Some(s.prompt_tokens.clone())
+                            } else {
+                                if !reason_ok {
+                                    eprintln!(
+                                        "[prefix_cache] skip save slot={} reason={} \
+                                         (only stop/length/cancel cached)",
+                                        slot_id, reason,
+                                    );
+                                } else if !generated_ok {
+                                    eprintln!(
+                                        "[prefix_cache] skip save slot={} generated_tokens=0 \
+                                         (slot never reached Generating)",
+                                        slot_id,
+                                    );
+                                }
+                                None
+                            }
+                        }
+                        None => None,
+                    }
+                };
+
+                if let Some(prompt_tokens) = cache_decision {
+                    // FFI save (must precede kv_cache_seq_rm_for).
+                    match self.llama.save_seq_state(slot_id as i32) {
+                        Ok(blob) => {
+                            if blob.is_empty() {
+                                eprintln!(
+                                    "[prefix_cache] save_seq_state returned empty blob for \
+                                     slot={} (FFI says 0 bytes) — skipping insert",
+                                    slot_id,
+                                );
+                            } else if let Some(trie_arc) = self.prefix_trie.as_ref() {
+                                let bytes = blob.len();
+                                let n_toks = prompt_tokens.len();
+                                match trie_arc.write() {
+                                    Ok(mut g) => {
+                                        let kv_id = g.next_kv_id();
+                                        let block = std::sync::Arc::new(
+                                            crate::prefix_cache::KVBlock::new(
+                                                kv_id, blob, n_toks,
+                                            ),
+                                        );
+                                        let fresh = g.insert(&prompt_tokens, block);
+                                        let trie_len = g.len();
+                                        let cached_bytes = g.cached_bytes();
+                                        eprintln!(
+                                            "[prefix_cache] insert slot={} kv_id={} n_toks={} \
+                                             bytes={} fresh={} trie_len={} cached_bytes={}",
+                                            slot_id, kv_id, n_toks, bytes, fresh,
+                                            trie_len, cached_bytes,
+                                        );
+                                    }
+                                    Err(e) => {
+                                        eprintln!(
+                                            "[prefix_cache] trie RwLock poisoned on insert \
+                                             slot={}: {} — skipping",
+                                            slot_id, e,
+                                        );
+                                    }
+                                }
+                            }
+                        }
+                        Err(e) => {
+                            eprintln!(
+                                "[prefix_cache] save_seq_state failed for slot={}: {} \
+                                 — skipping insert",
+                                slot_id, e,
+                            );
+                        }
+                    }
+                }
+            }
+
             // Free KV/SSM state for this seq_id.
             let _ = self.llama.kv_cache_seq_rm_for(slot_id as i32);
             // Mark slot free (also clears sampler state + recent_context).
@@ -2089,4 +2479,45 @@ mod tests {
         assert_eq!(argmax_u32(&[]), 0);
         assert_eq!(argmax_u32(&[0.1, 0.2, 0.9, 0.3]), 2);
     }
+
+    // ----------------------------------------------------------------
+    // M2-J2c — prefix-cache wiring invariants (unit, no FFI)
+    // ----------------------------------------------------------------
+
+    #[test]
+    fn m2_j2c_with_prefix_cache_none_is_inert() {
+        // Passing `None` must leave the scheduler in the M1 state —
+        // `prefix_trie = None`, `prefix_cache_enabled = false` — so
+        // the driver hot paths short-circuit without any RwLock touch.
+        let cfg = SchedulerConfig {
+            num_slots: 2, queue_cap: 4, enabled: true, native: true,
+        };
+        let sched = NativeScheduler::new(cfg, None, 0.0)
+            .unwrap()
+            .with_prefix_cache(None);
+        assert!(sched.prefix_trie.is_none());
+        assert!(!sched.prefix_cache_enabled);
+    }
+
+    #[test]
+    fn m2_j2c_with_prefix_cache_some_but_env_off_logs_and_disables() {
+        // Passing `Some(trie)` when CHIMERE_PREFIX_CACHE is NOT set
+        // (the default in tests) keeps the trie handle (so stats
+        // endpoint can still read it) but sets `prefix_cache_enabled`
+        // to false — the driver observes the gate and skips.
+        std::env::remove_var("CHIMERE_PREFIX_CACHE");
+        let cfg = SchedulerConfig {
+            num_slots: 2, queue_cap: 4, enabled: true, native: true,
+        };
+        let trie = std::sync::Arc::new(std::sync::RwLock::new(
+            crate::prefix_cache::PrefixTrie::new(16),
+        ));
+        let sched = NativeScheduler::new(cfg, None, 0.0)
+            .unwrap()
+            .with_prefix_cache(Some(trie));
+        assert!(sched.prefix_trie.is_some(),
+                "trie handle should be retained for stats readers");
+        assert!(!sched.prefix_cache_enabled,
+                "gate must be off when CHIMERE_PREFIX_CACHE unset");
+    }
 }