“And the compression is totally lossless.”
no, it’s not.
TurboRAG requires Python 3.10+ and installs as a lightweight package with numpy as its core dependency.
Install the latest version directly from GitHub:
python -m pip install "git+https://github.com/diyagk01/TurboRAG.git"Use TurboRAG in three steps: initialize one compressor, encode document embeddings into bytes, then score query embeddings directly against those bytes with inner_product_estimate(...).
import numpy as np
from turboquant_rag import TurboQuantCompressor
# 1) Create deterministic compressor
compressor = TurboQuantCompressor.new(
dim=1536,
angle_bits=4,
projections=32,
seed=42,
)
# 2) Encode a document embedding
doc_vec = np.random.randn(1536).astype(np.float32)
doc_code = compressor.encode(doc_vec)
# 3) Score a query embedding against compressed code
query_vec = np.random.randn(1536).astype(np.float32)
score = compressor.inner_product_estimate(doc_code, query_vec)
# Optional reconstruction
doc_hat = compressor.decode(doc_code)
print("compressed bytes:", len(doc_code))
print("score:", float(score))
print("decoded shape:", doc_hat.shape)For indexing workloads, use batch APIs and persistence:
codes = compressor.encode_batch(np.random.randn(128, 1536).astype(np.float32))
decoded = compressor.decode_batch(codes)
compressor.save("artifacts/turboquant_model")
reloaded = TurboQuantCompressor.load("artifacts/turboquant_model")TurboQuantCompressor.new(dim, angle_bits=4, projections=32, seed=42)TurboQuantCompressor.fit(samples, angle_bits=4, projections=32, seed=42)encode(vector) -> bytesdecode(code) -> np.ndarrayinner_product_estimate(code, query_vector) -> floatinner_product_estimate_batch(codes, query_vector) -> np.ndarrayencode_batch(matrix) -> list[bytes]decode_batch(codes) -> np.ndarraysave(path) -> NoneTurboQuantCompressor.load(path) -> TurboQuantCompressor
TurboRAG is an SDK for compressing embedding vectors in RAG systems. Instead of storing full-precision embeddings, TurboRAG encodes them into a smaller representation and reconstructs them when needed. This reduces memory usage and can speed up retrieval in large vector databases.
This implementation is based on "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate", which proposes compressing high-dimensional vectors through random rotation, polar quantization, and residual correction to maintain vector similarity.
turboquant_rag/
├── turboquant_rag/ # Core SDK package
│ ├── __init__.py # Public exports: TurboQuantCompressor, fit_compressor
│ ├── turboquant.py # Compression engine (rotation + PolarQuant + QJL residual sketch)
│ ├── factory.py # Factory helper for compressor creation
│ └── utils.py # Input validation utilities
├── tests/ # Unit tests for core behavior
│ ├── test_turboquant.py # Encode/decode, scoring signal, save/load
│ └── test_factory.py # Factory helper checks
├── benchmark/
│ ├── export_benchmark_data.py # Embedding export for offline benchmarking
│ └── rust_turbo_quant_bench/ # Rust benchmarking workspace (experimental)
├── README.md # Main SDK documentation
├── RAG.md # Minimal integration snippet
└── pyproject.toml # Packaging and dependency config TurboRAG compresses high-dimensional embeddings by deterministically mapping each input vector
The first step applies a fixed orthogonal rotation
The rotated vector is:
Because
In the SDK, this rotation matrix is created once in new(...) and reused for every encode and decode call.
The base quantization stage operates on 2D coordinate pairs ((z_{2i}, z_{2i+1})). For each pair, we convert Cartesian coordinates to polar form, keep the radius in full precision, and quantize only the angle.
For each pair:
The radius (r_i) is stored losslessly as a 32-bit float.
The angle (\theta_i) is quantized into (L = 2^b) bins, where (b = \text{angle_bits}), with angular step size:
This PolarQuant stage produces a compact base approximation with controlled angular error.
The residual represents what the base quantization missed:
Instead of storing full residual coordinates, the SDK generates
Only these sign bits
This keeps memory usage extremely low while still retaining useful directional information.
Decoding reconstructs the approximate residual using a sign-sketch estimator:
The full reconstruction combines both parts:
Because the SDK reuses the same seeded
For retrieval tasks, the SDK avoids decoding or storing full vectors.
Given a query vector
This allows fast similarity scoring using only compact codes and a small correction term, making TurboRAG’s retrieval both efficient and accurate without reconstructing full-precision embeddings.
We evaluate TurboRAG on the RAGBench dataset, using the techqa subset and train split in our current pipeline. The evaluation script builds a retrieval corpus directly from RAGBench documents, normalizes mixed document formats, chunks text into 220-word windows with 40-word overlap, deduplicates chunks, and caps the index at 4,000 chunks for a cost-controlled benchmark run. Each chunk is embedded with text-embedding-3-small at 1536 dimensions, then compressed with TurboQuant using angle_bits=4, projections=32, and seed=42.
At retrieval time, queries are embedded with the same model and scored against compressed document codes via inner_product_estimate(...), then ranked with top-k search (k=5). For quick validation, the script runs a mini-eval over 5 dataset questions and reports Hit@1, Hit@3, and Hit@5 using a source-row match proxy (a hit is counted when the query’s original row appears in retrieved metadata at cutoff k). This gives a concrete signal of ranking preservation under compression without requiring a full generator in the loop.
In addition to end-to-end retrieval checks, the SDK has deterministic correctness tests for the compressor itself. The test suite verifies that decoded vectors maintain high semantic similarity (cosine > 0.90), that score estimation is numerically consistent (|estimated − decoded_dot| < 1e−3), that estimated vs true dot products have strong correlation (> 0.9), and that serialization is stable (save/load reconstruction matches). These checks validate that compression quality remains strong while reducing storage to compact byte codes.
