TheTom · MaTriXy · Mar 26, 2026
diff --git a/docs/turboquant-enhancements.md b/docs/turboquant-enhancements.md
@@ -0,0 +1,230 @@
+# TurboQuant+ Enhancements: Layer-Adaptive, Beta Codebook, Temporal Decay
+
+Improvements to the TurboQuant KV cache compression implementation, applying findings from the TurboQuant paper (ICLR 2026) and extended experiments.
+
+## Overview
+
+| Enhancement | File | Status | Tests |
+|-------------|------|--------|-------|
+| Layer-Adaptive Compressor | `turboquant/layer_adaptive.py` | Complete | 16 |
+| Beta Distribution Codebook | `turboquant/codebook.py` | Complete | 22 |
+| Temporal Decay Compressor | `turboquant/temporal_decay.py` | Complete (Python) | 22 |
+
+Total: 60 new tests, all passing. Original 141 tests unaffected.
+
+---
+
+## 1. Layer-Adaptive Compressor
+
+**File:** `turboquant/layer_adaptive.py`
+
+### Problem
+
+Uniform bit-width across all layers wastes precision. The last ~20% of transformer layers are responsible for nearly all quality loss under aggressive quantization (validated on Qwen 3.5 35B-A3B: layers 32-39 of 40 cause ~100% of PPL degradation).
+
+### Solution
+
+`LayerAdaptiveCompressor` assigns different bit-widths per layer, using aggressive compression on early (insensitive) layers and higher precision on late (sensitive) layers.
+
+### API
+
+```python
+from turboquant import LayerAdaptiveCompressor
+from turboquant.layer_adaptive import make_layer_config, default_40layer_config
+
+# Preset: 40-layer model, Mode 2
+# Layers 0-31: 3-bit TurboQuant, Layers 32-39: 8-bit
+config = default_40layer_config()
+compressor = LayerAdaptiveCompressor(head_dim=128, layers_config=config)
+
+# Custom config for any model size
+config = make_layer_config(
+    total_layers=64,     # e.g., Llama 3 70B
+    default_bits=3,      # aggressive for early layers
+    high_bits=8,         # high fidelity for late layers
+    high_frac=0.2,       # last 20% get high_bits
+)
+compressor = LayerAdaptiveCompressor(head_dim=128, layers_config=config)
+
+# Compress KV cache (shape: [num_layers, num_heads, seq_len, head_dim])
+compressed = compressor.compress(k_cache, v_cache)
+
+# Decompress
+k_hat, v_hat = compressor.decompress(compressed)
+
+# Statistics
+ratio = compressor.effective_compression_ratio()   # ~3.5x effective
+bits = compressor.effective_bits_per_value()        # ~4.0 average
+summary = compressed.layer_summary()                # per-layer breakdown
+```
+
+### Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `head_dim` | `int` | required | Attention head dimension |
+| `layers_config` | `dict[int, int]` | required | Layer index -> bit-width mapping |
+| `v_bits_override` | `dict[int, int] \| None` | `None` | Separate V cache bit-widths (if different from K) |
+| `seed` | `int` | `42` | Random seed for rotation matrices |
+
+### Expected Results
+
+| Configuration | Effective Compression | PPL (wikitext-2) | vs q8_0 |
+|---------------|----------------------|-------------------|---------|
+| Uniform turbo3 (3-bit) | 4.6x | 5.460 | +0.8% |
+| **Mode 2 (3-bit + q8_0 last 20%)** | **3.5x** | **6.120** | **+0.14%** |
+| Uniform q8_0 (8-bit) | 2.0x | 5.414 | baseline |
+
+Mode 2 achieves near-q8_0 quality at 3.5x compression — the best quality/compression trade-off.
+
+---
+
+## 2. Beta Distribution Codebook
+
+**File:** `turboquant/codebook.py` (enhanced)
+
+### Problem
+
+After random rotation, each coordinate follows a Beta(d/2, d/2) distribution (supported on [-1/sqrt(d), 1/sqrt(d)]), which converges to N(0, 1/d) for large d. The existing codebook uses the Gaussian approximation for all dimensions, which is suboptimal for d < 256.
+
+### Solution
+
+Added `_lloyds_beta()` that runs Lloyd's algorithm on the true Beta(d/2, d/2) distribution instead of the Gaussian approximation. The `compute_centroids()` function gains a `use_beta` parameter.
+
+### API
+
+```python
+from turboquant.codebook import compute_centroids
+
+# Gaussian approximation (existing, default)
+centroids = compute_centroids(bits=3, d=128)
+
+# Beta distribution (new, tighter for small d)
+centroids = compute_centroids(bits=3, d=128, use_beta=True)
+```
+
+### When to Use
+
+- **d < 256**: Beta codebook gives measurably tighter MSE (up to ~0.5% improvement)
+- **d >= 256**: Beta and Gaussian produce near-identical codebooks (use default for speed)
+- **bit_width < 3**: Closed-form centroids are used regardless (1-bit and 2-bit have exact solutions)
+
+### Technical Details
+
+The Beta codebook uses `scipy.stats.beta` for PDF evaluation and a specialized conditional expectation function for centroid updates:
+
+```
+E[X | a < X < b] for X ~ Beta(d/2, d/2)
+```
+
+This is computed via the incomplete beta function identity, which is more numerically stable than sampling.
+
+---
+
+## 3. Temporal Decay Compressor
+
+**File:** `turboquant/temporal_decay.py`
+
+### Problem
+
+All tokens in the KV cache are stored at the same precision, but older tokens contribute less to attention. At long context (32K+), most of the cache holds tokens that are rarely attended to.
+
+### Solution
+
+`TemporalDecayCompressor` maps token age to bit-width: recent tokens get higher precision, old tokens get lower precision. With optional layer-awareness, early layers (which are less sensitive) decay faster.
+
+### API
+
+```python
+from turboquant import TemporalDecayCompressor, TemporalDecayConfig
+
+config = TemporalDecayConfig(
+    recent_bits=3,       # 3-bit for tokens younger than threshold
+    old_bits=2,          # 2-bit for tokens older than threshold
+    decay_threshold=256, # age boundary (in token steps)
+    layer_aware=True,    # early layers decay faster
+)
+
+tdc = TemporalDecayCompressor(head_dim=128, config=config)
+
+# Query bit-width for a specific token
+bits = tdc.get_bits_for_token(age=300, layer=5, total_layers=40)
+
+# Compress with age-awareness
+result = tdc.compress_with_decay(
+    keys,           # [num_heads, seq_len, head_dim]
+    values,         # [num_heads, seq_len, head_dim]
+    token_ages,     # [seq_len] — age in steps for each token
+    layer_idx=5,
+    total_layers=40,
+)
+
+# Decompress
+k_hat, v_hat = tdc.decompress_with_decay(result)
+
+# Estimate savings
+savings = tdc.memory_savings_estimate()
+```
+
+### Configuration
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `recent_bits` | `3` | Bit-width for recent tokens |
+| `old_bits` | `2` | Bit-width for old tokens |
+| `decay_threshold` | `256` | Token age at which precision drops |
+| `layer_aware` | `True` | Early layers decay faster than late layers |
+
+### Layer-Aware Behavior
+
+When `layer_aware=True`:
+- **Late layers (last 20%):** Always use `recent_bits`, regardless of token age
+- **Early layers (first 80%):** Threshold scales linearly with layer position
+  - Layer 0: decays at `threshold * 0.5` (aggressive)
+  - Layer at 80% cutoff: decays at full `threshold`
+
+This reflects the finding that late layers are quality-sensitive and should always keep high precision.
+
+### Expected Savings
+
+| Context Length | Uniform 3-bit | With Temporal Decay | Additional Savings |
+|---------------|---------------|--------------------|--------------------|
+| 4K | 4.6x | ~4.8x | ~4% |
+| 16K | 4.6x | ~5.5x | ~20% |
+| 32K | 4.6x | ~6.2x | ~35% |
+| 128K | 4.6x | ~7.0x | ~52% |
+
+Savings increase with context length because a larger fraction of tokens are "old" at any given time.
+
+### Status
+
+Python logic is complete and tested. llama.cpp C integration is blocked on:
+- `turbo2` block type not yet implemented in the C port
+- `llama_kv_cache::update()` hook needed for token age tracking
+
+---
+
+## Running Tests
+
+```bash
+cd /path/to/turboquant_plus-main
+
+# All tests (201 total)
+python -m pytest tests/ -v
+
+# Just new enhancement tests
+python -m pytest tests/test_layer_adaptive.py tests/test_codebook_beta.py tests/test_temporal_decay.py -v
+```
+
+## Files Changed
+
+### New Files
+- `turboquant/layer_adaptive.py` — Layer-adaptive compressor (~180 lines)
+- `turboquant/temporal_decay.py` — Temporal decay compressor (~160 lines)
+- `tests/test_layer_adaptive.py` — 16 tests
+- `tests/test_codebook_beta.py` — 22 tests
+- `tests/test_temporal_decay.py` — 22 tests
+
+### Modified Files
+- `turboquant/codebook.py` — Added `_lloyds_beta()`, `use_beta` parameter
+- `turboquant/__init__.py` — Exports: `LayerAdaptiveCompressor`, `TemporalDecayCompressor`, `TemporalDecayConfig`
diff --git a/tests/test_codebook_beta.py b/tests/test_codebook_beta.py
@@ -0,0 +1,163 @@
+"""Tests for Beta distribution codebook enhancement."""
+
+import numpy as np
+import pytest
+
+from turboquant.codebook import (
+    compute_centroids,
+    optimal_centroids,
+    _lloyds_beta,
+    _beta_conditional_expectation,
+)
+from scipy import stats
+
+
+class TestBetaConditionalExpectation:
+    """Test the E[X | a < X < b] helper for Beta distributions."""
+
+    def test_full_range_equals_mean(self):
+        """E[X | 0 < X < 1] should equal the distribution mean."""
+        rv = stats.beta(3.0, 3.0)
+        result = _beta_conditional_expectation(rv, 0.0, 1.0)
+        np.testing.assert_allclose(result, 0.5, rtol=1e-6)
+
+    def test_upper_half(self):
+        """E[X | 0.5 < X < 1.0] for symmetric Beta should be > 0.5."""
+        rv = stats.beta(5.0, 5.0)
+        result = _beta_conditional_expectation(rv, 0.5, 1.0)
+        assert result > 0.5
+        assert result < 1.0
+
+    def test_lower_half(self):
+        """E[X | 0 < X < 0.5] for symmetric Beta should be < 0.5."""
+        rv = stats.beta(5.0, 5.0)
+        result = _beta_conditional_expectation(rv, 0.0, 0.5)
+        assert result < 0.5
+        assert result > 0.0
+
+    def test_symmetric_halves(self):
+        """For symmetric Beta, E[X|X<0.5] + E[X|X>0.5] should equal 1.0."""
+        rv = stats.beta(10.0, 10.0)
+        low = _beta_conditional_expectation(rv, 0.0, 0.5)
+        high = _beta_conditional_expectation(rv, 0.5, 1.0)
+        np.testing.assert_allclose(low + high, 1.0, rtol=1e-6)
+
+    def test_narrow_interval(self):
+        """Narrow interval conditional mean should be near midpoint."""
+        rv = stats.beta(5.0, 5.0)
+        result = _beta_conditional_expectation(rv, 0.49, 0.51)
+        np.testing.assert_allclose(result, 0.5, atol=0.02)
+
+    def test_extreme_interval_fallback(self):
+        """Extremely narrow interval far from mass should use fallback."""
+        rv = stats.beta(50.0, 50.0)
+        # Very far in the tail - probability underflows
+        result = _beta_conditional_expectation(rv, 0.99, 1.0)
+        assert np.isfinite(result)
+
+    def test_asymmetric_beta(self):
+        """Should work with asymmetric Beta as well."""
+        rv = stats.beta(2.0, 5.0)
+        result = _beta_conditional_expectation(rv, 0.0, 1.0)
+        expected_mean = 2.0 / (2.0 + 5.0)
+        np.testing.assert_allclose(result, expected_mean, rtol=1e-6)
+
+
+class TestLloydsBeta:
+    """Test Lloyd's algorithm with Beta distribution."""
+
+    def test_correct_count(self):
+        """Should produce 2^b centroids."""
+        for b in [3, 4]:
+            n = 1 << b
+            centroids = _lloyds_beta(n, d=64)
+            assert len(centroids) == n
+
+    def test_centroids_sorted(self):
+        """Centroids should be sorted ascending."""
+        centroids = _lloyds_beta(8, d=64)
+        assert np.all(np.diff(centroids) > 0)
+
+    def test_centroids_centered(self):
+        """Centroids should be roughly centered around 0."""
+        centroids = _lloyds_beta(8, d=64)
+        assert abs(np.mean(centroids)) < 0.01
+
+    def test_centroids_symmetric(self):
+        """For symmetric Beta(d/2,d/2), centroids should be symmetric around 0."""
+        centroids = _lloyds_beta(8, d=128)
+        np.testing.assert_allclose(centroids, -centroids[::-1], atol=1e-6)
+
+    def test_centroids_within_range(self):
+        """All centroids should be within [-1/sqrt(d), 1/sqrt(d)]."""
+        d = 64
+        centroids = _lloyds_beta(8, d=d)
+        bound = 1.0 / np.sqrt(d)
+        assert np.all(centroids >= -bound - 1e-10)
+        assert np.all(centroids <= bound + 1e-10)
+
+    def test_scale_with_dimension(self):
+        """Centroid magnitude should decrease with increasing d."""
+        c_small = _lloyds_beta(8, d=32)
+        c_large = _lloyds_beta(8, d=128)
+        assert np.max(np.abs(c_small)) > np.max(np.abs(c_large))
+
+    def test_16_centroids(self):
+        """4-bit beta codebook should produce 16 centroids."""
+        centroids = _lloyds_beta(16, d=64)
+        assert len(centroids) == 16
+        assert np.all(np.diff(centroids) > 0)
+
+
+class TestComputeCentroids:
+    """Test the unified compute_centroids dispatcher."""
+
+    def test_use_beta_false_matches_optimal(self):
+        """use_beta=False should give same result as optimal_centroids."""
+        for b in [1, 2, 3]:
+            for d in [64, 128]:
+                c1 = compute_centroids(b, d, use_beta=False)
+                c2 = optimal_centroids(b, d)
+                np.testing.assert_array_equal(c1, c2)
+
+    def test_use_beta_true_small_d(self):
+        """use_beta=True with small d should use Beta codebook."""
+        # For b >= 3 and d < 256, should use Beta
+        c_beta = compute_centroids(3, d=64, use_beta=True)
+        c_gauss = compute_centroids(3, d=64, use_beta=False)
+        # They should be different (Beta vs Gaussian optimization)
+        assert not np.allclose(c_beta, c_gauss, atol=1e-8)
+
+    def test_use_beta_true_large_d_falls_back(self):
+        """use_beta=True with large d should fall back to Gaussian."""
+        c_beta = compute_centroids(3, d=256, use_beta=True)
+        c_gauss = compute_centroids(3, d=256, use_beta=False)
+        np.testing.assert_array_equal(c_beta, c_gauss)
+
+    def test_use_beta_true_low_bits_falls_back(self):
+        """use_beta=True with bit_width < 3 should fall back to Gaussian."""
+        for b in [1, 2]:
+            c_beta = compute_centroids(b, d=64, use_beta=True)
+            c_gauss = compute_centroids(b, d=64, use_beta=False)
+            np.testing.assert_array_equal(c_beta, c_gauss)
+
+    def test_beta_centroids_still_sorted(self):
+        """Beta centroids should still be sorted."""
+        c = compute_centroids(3, d=64, use_beta=True)
+        assert np.all(np.diff(c) > 0)
+
+    def test_beta_centroids_correct_count(self):
+        """Beta centroids should have 2^b entries."""
+        c = compute_centroids(4, d=64, use_beta=True)
+        assert len(c) == 16
+
+    def test_beta_centroids_symmetric(self):
+        """Beta centroids should be symmetric for symmetric Beta."""
+        c = compute_centroids(3, d=64, use_beta=True)
+        np.testing.assert_allclose(c, -c[::-1], atol=1e-6)
+
+    def test_default_use_beta_false(self):
+        """Default use_beta=False should match optimal_centroids."""
+        c1 = compute_centroids(3, 128)
+        c2 = optimal_centroids(3, 128)
+        np.testing.assert_array_equal(c1, c2)