Skip to content

LORD-ZYTHOZ/turboquant-node

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


Typing SVG


Rust Node Python ICLR Apple


╔══════════════════════════════════════════════════════════════════════════════╗
β•‘  πŸ”¬  TURBOQUANT (ICLR 2026) β€” NODE.JS + BUN + APPLE SILICON MLX  πŸ”¬        β•‘
β•‘                                                                              β•‘
β•‘  Rust napi-rs bindings for compressed vector search.                        β•‘
β•‘  MLX KV cache compression for Qwen3 β€” 32k β†’ 128k context on M-series.      β•‘
β•‘  Production case study. 7 research findings. 3 bugs found and fixed.        β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

> WHAT_IS_TURBOQUANT.exe

TurboQuant is a vector quantization algorithm from Google Research, published at ICLR 2026. It compresses high-dimensional vectors while preserving the inner products needed for similarity search.

The key property: similarity is computed directly on compressed codes β€” no decompression step.

Input vector  (f32[768])
  β†’ Random orthogonal rotation     ← QR/Haar β€” decorrelates dimensions
  β†’ PolarQuant MSE quantization    ← 4–8 bit lossy compression
  β†’ QJL residual sketch            ← 1-bit per projection, corrects norm bias
  β†’ TurboCode (radii + angles + sign bits)

> THIS_REPO.exe

Two production streams β€” one research writeup:

Stream 1  β†’  rust-bindings/    napi-rs wrapper for Node.js / Bun
             Compressed embedding search. No decompression at query time.

Stream 2  β†’  mlx-kv/           Python patches for Qwen3 on Apple Silicon
             Asymmetric KV cache (K→6bit, V→4bit). 32k → 128k effective context.

Research  β†’  research/FINDINGS.md   7 findings from 5 repos, 2 implementation streams

Built while integrating TurboQuant into Theorex β€” a persistent multi-agent memory system running 4320+ concept embeddings at query time.


> ARCHITECTURE.exe

graph LR
    subgraph JS["⚑ NODE.JS / BUN"]
        NQ["NativeQuantizer\n(768, 8, 192, 42n)"]
        ENC["encode(f32[768])\n→ Buffer 2545B"]
        EST["innerProductEstimate()\n→ similarity score"]
    end

    subgraph RUST["πŸ¦€ RUST (napi-rs)"]
        ROT["Orthogonal Rotation\nQR / Haar matrix"]
        PQ["PolarQuant\nLloyd-Max codebook"]
        QJL["QJL Residual Sketch\n1-bit projections"]
        TC["TurboCode\nradii + angles + signs"]
    end

    subgraph MLX["🍎 APPLE SILICON MLX"]
        HC["HybridCache\n256-token fp16 window"]
        CS["Compressed Store\nK→6bit · V→4bit"]
        QA["Qwen3Attention\nMonkeyPatch forward()"]
    end

    NQ --> ROT
    ROT --> PQ
    PQ --> QJL
    QJL --> TC
    TC --> ENC
    ENC --> EST

    QA --> HC
    HC --> CS

    style JS fill:#1a0d00,stroke:#ff8800,color:#ff8800
    style RUST fill:#0d0800,stroke:#ffaa44,color:#ffaa44
    style MLX fill:#0a0a0a,stroke:#ff6600,color:#ff6600
Loading

> QUICKSTART_NODE.exe

cd rust-bindings
npm install          # or: bun install
npm run build        # β†’ turbo-quant-native.darwin-arm64.node (395KB)
# Requires: curl https://sh.rustup.rs | sh
import { NativeQuantizer } from "./rust-bindings/index.js";

// 768d embeddings β€” nomic-embed-text, all-MiniLM, etc.
// seed is baked into stored codes β€” must match at query time
const quantizer = new NativeQuantizer(768, 8, 192, 42n);

// Compress for storage (Postgres BYTEA, Redis, etc.)
const embedding = new Float32Array(768);
const code: Buffer = quantizer.encode(embedding);

// Search β€” no decompression needed
const queryVec = new Float32Array(768);
const score: number = quantizer.innerProductEstimate(code, queryVec);

Two-stage search pattern

import { search } from "./examples/compressed-search.js";

// Pre-filter 200 candidates on compressed codes β†’ rerank top 10 on full embeddings
const rows = await db`SELECT id, label, compressed_vector, embedding FROM concepts`;
const results = search(rows, queryEmbedding, { preFilterN: 200, topK: 10 });

> QUICKSTART_MLX.exe

cd mlx-kv
pip install mlx-lm requests pytest numpy
import mlx_lm
from patches.cache_manager import HybridCache
from patches.qwen3_attention import install_caches, patch_qwen3_attention

model, tokenizer = mlx_lm.load("mlx-community/Qwen3-32B-Instruct-4bit")

# 256-token fp16 window + compressed long-term store
caches = install_caches(model, window_size=256)
patch_qwen3_attention(model)

# MLX is lazy β€” ALWAYS warmup before real inference
_ = mlx_lm.generate(model, tokenizer, prompt="warmup", max_tokens=1)

response = mlx_lm.generate(model, tokenizer, prompt="...", max_tokens=500)

> BIT_WIDTH_GUIDE.exe

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Use Case       β”‚ K bits β”‚ V bits β”‚ Compression β”‚ Quality Loss β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Max quality    β”‚   6    β”‚   4    β”‚     ~4x     β”‚   < 0.5%     β”‚
β”‚  Sweet spot     β”‚   4    β”‚   3    β”‚     ~6x     β”‚   ~ 1.2%     β”‚
β”‚  Aggressive     β”‚   3    β”‚   2    β”‚     ~8x     β”‚  noticeable  β”‚
β”‚  Embed search   β”‚   8    β”‚   β€”    β”‚     ~1.2x   β”‚   < 0.1%     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Keys need more bits β€” exponentially more sensitive to quantization error.
Values tolerate more loss β€” attention score weighting dampens errors.

> RESEARCH_FINDINGS.exe

β–Ά 7 findings from 5 repos and 2 production streams

Finding 1 β€” QJL must target the residual only The PyTorch reference broke because it applied QJL to the full input vector. QJL corrects the norm bias left by PolarQuant β€” on the full vector it amplifies variance. The RecursiveIntell Rust crate applies it correctly. If your QJL tests fail, check where it's being applied.

Finding 2 — Asymmetric bit allocation is not optional K→6bit, V→4bit is the correct split. We found a copy-paste bug where values were quantized at 6-bit. Fixed.

Finding 3 β€” Seed is baked into stored codes The random projection matrix is seeded. Changing the seed invalidates all stored codes. Document it. Never change it without a full backfill.

Finding 4 β€” napi6 required for BigInt Default napi4 doesn't support BigInt. The seed is a u64 β†’ JS bigint. Use features = ["napi6"] in Cargo.toml.

Finding 5 β€” MLX warmup is mandatory MLX uses lazy evaluation. First inference after server start produces garbage. Always send a throwaway prompt before running real tests or accepting traffic.

Finding 6 β€” Kernel fusion matters on Apple Silicon Naive sequential ops saturate the memory bus on M-series. The Rust crate fuses rotate β†’ quantize β†’ sketch into one pass. Significantly faster than equivalent TypeScript.

Finding 7 β€” Large prefill evicts recent tokens (CRITICAL) If prefill > window size, naive eviction compresses the full chunk including most-recent tokens. Only evict the oldest excess tokens; keep newest window_size in fp16. This caused 2K needle test failures.

Full writeup: research/FINDINGS.md


> BUGS_FIXED.exe

β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ # β”‚ Bug                                                          β”‚ Severity β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1 β”‚ Large prefill evicts recent tokens β€” compresses too eagerly  β”‚ CRITICAL β”‚
β”‚ 2 β”‚ Continuous batching crash β€” batch dim grows mid-decode       β”‚ HIGH     β”‚
β”‚ 3 β”‚ V-cache quantized at 6-bit instead of 4-bit (copy-paste)     β”‚ MEDIUM   β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

> PROJECT_STRUCTURE.exe

turboquant-node/
β”œβ”€β”€ rust-bindings/              ← napi-rs Rust package
β”‚   β”œβ”€β”€ src/lib.rs              ← NativeQuantizer: encode + innerProductEstimate
β”‚   β”œβ”€β”€ Cargo.toml              ← depends on RecursiveIntell/turbo-quant
β”‚   └── index.d.ts              ← TypeScript types (napi-rs generated)
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ compressed-search.ts    ← two-stage pre-filter + rerank
β”‚   └── backfill.ts             ← re-encode existing embeddings
β”œβ”€β”€ mlx-kv/
β”‚   β”œβ”€β”€ patches/
β”‚   β”‚   β”œβ”€β”€ cache_manager.py    ← HybridCache: fp16 window + compressed store
β”‚   β”‚   β”œβ”€β”€ qwen3_attention.py  ← MonkeyPatch for Qwen3Attention.forward
β”‚   β”‚   └── quantizers.py       ← Lloyd-Max 4-bit + 6-bit codebooks
β”‚   └── tests/
β”‚       β”œβ”€β”€ conftest.py         ← warmup_server fixture
β”‚       β”œβ”€β”€ test_cache_manager.py   ← unit tests (all 3 bugs covered)
β”‚       └── test_needle.py      ← needle-in-haystack integration tests
└── research/
    └── FINDINGS.md             ← full research writeup


β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Built in Sydney Β· Part of the Theorex multi-agent memory system β”‚
β”‚  5 repos evaluated Β· 7 findings Β· 3 bugs fixed Β· 1260 tests pass β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

MIT License Β· Built on RecursiveIntell/turbo-quant


About

πŸ”¬ TurboQuant (ICLR 2026) for Node.js/Bun via napi-rs + MLX KV cache for Qwen3 on Apple Silicon β€” 7 research findings, 3 production bugs fixed

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors