GitHub - LORD-ZYTHOZ/turboquant-node: 🔬 TurboQuant (ICLR 2026) for Node.js/Bun via napi-rs + MLX KV cache for Qwen3 on Apple Silicon — 7 research findings, 3 production bugs fixed

╔══════════════════════════════════════════════════════════════════════════════╗
║  🔬  TURBOQUANT (ICLR 2026) — NODE.JS + BUN + APPLE SILICON MLX  🔬        ║
║                                                                              ║
║  Rust napi-rs bindings for compressed vector search.                        ║
║  MLX KV cache compression for Qwen3 — 32k → 128k context on M-series.      ║
║  Production case study. 7 research findings. 3 bugs found and fixed.        ║
╚══════════════════════════════════════════════════════════════════════════════╝

> WHAT_IS_TURBOQUANT.exe

TurboQuant is a vector quantization algorithm from Google Research, published at ICLR 2026. It compresses high-dimensional vectors while preserving the inner products needed for similarity search.

The key property: similarity is computed directly on compressed codes — no decompression step.

Input vector  (f32[768])
  → Random orthogonal rotation     ← QR/Haar — decorrelates dimensions
  → PolarQuant MSE quantization    ← 4–8 bit lossy compression
  → QJL residual sketch            ← 1-bit per projection, corrects norm bias
  → TurboCode (radii + angles + sign bits)

> THIS_REPO.exe

Two production streams — one research writeup:

Stream 1  →  rust-bindings/    napi-rs wrapper for Node.js / Bun
             Compressed embedding search. No decompression at query time.

Stream 2  →  mlx-kv/           Python patches for Qwen3 on Apple Silicon
             Asymmetric KV cache (K→6bit, V→4bit). 32k → 128k effective context.

Research  →  research/FINDINGS.md   7 findings from 5 repos, 2 implementation streams

Built while integrating TurboQuant into Theorex — a persistent multi-agent memory system running 4320+ concept embeddings at query time.

> ARCHITECTURE.exe

graph LR
    subgraph JS["⚡ NODE.JS / BUN"]
        NQ["NativeQuantizer\n(768, 8, 192, 42n)"]
        ENC["encode(f32[768])\n→ Buffer 2545B"]
        EST["innerProductEstimate()\n→ similarity score"]
    end

    subgraph RUST["🦀 RUST (napi-rs)"]
        ROT["Orthogonal Rotation\nQR / Haar matrix"]
        PQ["PolarQuant\nLloyd-Max codebook"]
        QJL["QJL Residual Sketch\n1-bit projections"]
        TC["TurboCode\nradii + angles + signs"]
    end

    subgraph MLX["🍎 APPLE SILICON MLX"]
        HC["HybridCache\n256-token fp16 window"]
        CS["Compressed Store\nK→6bit · V→4bit"]
        QA["Qwen3Attention\nMonkeyPatch forward()"]
    end

    NQ --> ROT
    ROT --> PQ
    PQ --> QJL
    QJL --> TC
    TC --> ENC
    ENC --> EST

    QA --> HC
    HC --> CS

    style JS fill:#1a0d00,stroke:#ff8800,color:#ff8800
    style RUST fill:#0d0800,stroke:#ffaa44,color:#ffaa44
    style MLX fill:#0a0a0a,stroke:#ff6600,color:#ff6600

> QUICKSTART_NODE.exe

cd rust-bindings
npm install          # or: bun install
npm run build        # → turbo-quant-native.darwin-arm64.node (395KB)
# Requires: curl https://sh.rustup.rs | sh

import { NativeQuantizer } from "./rust-bindings/index.js";

// 768d embeddings — nomic-embed-text, all-MiniLM, etc.
// seed is baked into stored codes — must match at query time
const quantizer = new NativeQuantizer(768, 8, 192, 42n);

// Compress for storage (Postgres BYTEA, Redis, etc.)
const embedding = new Float32Array(768);
const code: Buffer = quantizer.encode(embedding);

// Search — no decompression needed
const queryVec = new Float32Array(768);
const score: number = quantizer.innerProductEstimate(code, queryVec);

Two-stage search pattern

import { search } from "./examples/compressed-search.js";

// Pre-filter 200 candidates on compressed codes → rerank top 10 on full embeddings
const rows = await db`SELECT id, label, compressed_vector, embedding FROM concepts`;
const results = search(rows, queryEmbedding, { preFilterN: 200, topK: 10 });

> QUICKSTART_MLX.exe

cd mlx-kv
pip install mlx-lm requests pytest numpy

import mlx_lm
from patches.cache_manager import HybridCache
from patches.qwen3_attention import install_caches, patch_qwen3_attention

model, tokenizer = mlx_lm.load("mlx-community/Qwen3-32B-Instruct-4bit")

# 256-token fp16 window + compressed long-term store
caches = install_caches(model, window_size=256)
patch_qwen3_attention(model)

# MLX is lazy — ALWAYS warmup before real inference
_ = mlx_lm.generate(model, tokenizer, prompt="warmup", max_tokens=1)

response = mlx_lm.generate(model, tokenizer, prompt="...", max_tokens=500)

> BIT_WIDTH_GUIDE.exe

┌─────────────────┬────────┬────────┬─────────────┬──────────────┐
│  Use Case       │ K bits │ V bits │ Compression │ Quality Loss │
├─────────────────┼────────┼────────┼─────────────┼──────────────┤
│  Max quality    │   6    │   4    │     ~4x     │   < 0.5%     │
│  Sweet spot     │   4    │   3    │     ~6x     │   ~ 1.2%     │
│  Aggressive     │   3    │   2    │     ~8x     │  noticeable  │
│  Embed search   │   8    │   —    │     ~1.2x   │   < 0.1%     │
└─────────────────┴────────┴────────┴─────────────┴──────────────┘

Keys need more bits — exponentially more sensitive to quantization error.
Values tolerate more loss — attention score weighting dampens errors.

> RESEARCH_FINDINGS.exe

▶ 7 findings from 5 repos and 2 production streams

Finding 1 — QJL must target the residual only The PyTorch reference broke because it applied QJL to the full input vector. QJL corrects the norm bias left by PolarQuant — on the full vector it amplifies variance. The RecursiveIntell Rust crate applies it correctly. If your QJL tests fail, check where it's being applied.

Finding 2 — Asymmetric bit allocation is not optional K→6bit, V→4bit is the correct split. We found a copy-paste bug where values were quantized at 6-bit. Fixed.

Finding 3 — Seed is baked into stored codes The random projection matrix is seeded. Changing the seed invalidates all stored codes. Document it. Never change it without a full backfill.

Finding 4 — napi6 required for BigInt Default napi4 doesn't support BigInt. The seed is a u64 → JS bigint. Use features = ["napi6"] in Cargo.toml.

Finding 5 — MLX warmup is mandatory MLX uses lazy evaluation. First inference after server start produces garbage. Always send a throwaway prompt before running real tests or accepting traffic.

Finding 6 — Kernel fusion matters on Apple Silicon Naive sequential ops saturate the memory bus on M-series. The Rust crate fuses rotate → quantize → sketch into one pass. Significantly faster than equivalent TypeScript.

Finding 7 — Large prefill evicts recent tokens (CRITICAL) If prefill > window size, naive eviction compresses the full chunk including most-recent tokens. Only evict the oldest excess tokens; keep newest window_size in fp16. This caused 2K needle test failures.

Full writeup: research/FINDINGS.md

> BUGS_FIXED.exe

┌───┬──────────────────────────────────────────────────────────────┬──────────┐
│ # │ Bug                                                          │ Severity │
├───┼──────────────────────────────────────────────────────────────┼──────────┤
│ 1 │ Large prefill evicts recent tokens — compresses too eagerly  │ CRITICAL │
│ 2 │ Continuous batching crash — batch dim grows mid-decode       │ HIGH     │
│ 3 │ V-cache quantized at 6-bit instead of 4-bit (copy-paste)     │ MEDIUM   │
└───┴──────────────────────────────────────────────────────────────┴──────────┘

> PROJECT_STRUCTURE.exe

turboquant-node/
├── rust-bindings/              ← napi-rs Rust package
│   ├── src/lib.rs              ← NativeQuantizer: encode + innerProductEstimate
│   ├── Cargo.toml              ← depends on RecursiveIntell/turbo-quant
│   └── index.d.ts              ← TypeScript types (napi-rs generated)
├── examples/
│   ├── compressed-search.ts    ← two-stage pre-filter + rerank
│   └── backfill.ts             ← re-encode existing embeddings
├── mlx-kv/
│   ├── patches/
│   │   ├── cache_manager.py    ← HybridCache: fp16 window + compressed store
│   │   ├── qwen3_attention.py  ← MonkeyPatch for Qwen3Attention.forward
│   │   └── quantizers.py       ← Lloyd-Max 4-bit + 6-bit codebooks
│   └── tests/
│       ├── conftest.py         ← warmup_server fixture
│       ├── test_cache_manager.py   ← unit tests (all 3 bugs covered)
│       └── test_needle.py      ← needle-in-haystack integration tests
└── research/
    └── FINDINGS.md             ← full research writeup

┌──────────────────────────────────────────────────────────────────┐
│  Built in Sydney · Part of the Theorex multi-agent memory system │
│  5 repos evaluated · 7 findings · 3 bugs fixed · 1260 tests pass │
└──────────────────────────────────────────────────────────────────┘

MIT License · Built on RecursiveIntell/turbo-quant

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
mlx-kv		mlx-kv
research		research
rust-bindings		rust-bindings
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

> WHAT_IS_TURBOQUANT.exe

> THIS_REPO.exe

> ARCHITECTURE.exe

> QUICKSTART_NODE.exe

Two-stage search pattern

> QUICKSTART_MLX.exe

> BIT_WIDTH_GUIDE.exe

> RESEARCH_FINDINGS.exe

> BUGS_FIXED.exe

> PROJECT_STRUCTURE.exe

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

> WHAT_IS_TURBOQUANT.exe

> THIS_REPO.exe

> ARCHITECTURE.exe

> QUICKSTART_NODE.exe

Two-stage search pattern

> QUICKSTART_MLX.exe

> BIT_WIDTH_GUIDE.exe

> RESEARCH_FINDINGS.exe

> BUGS_FIXED.exe

> PROJECT_STRUCTURE.exe

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages