TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.
-
Updated
Mar 29, 2026 - Python
TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.
First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.
Near-optimal vector quantization from Google's ICLR 2026 paper — 95% recall, 5x compression, zero preprocessing, pure Python FAISS replacement
Minimal, zero-dependency LLM inference in pure C11. CPU-first with NEON/AVX2 SIMD. Flash MoE (pread + LRU expert cache). TurboQuant 3-bit KV compression (8.9x less memory per session). 20+ GGUF quant formats. Compiles to WASM.
No bs theatricals. Real automated pentesting. Mac only.
TurboQuant: Native 3-Bit Quantization for Ollama - Achieve 25-28% better compression than Q4_0 while maintaining high-speed CPU inference. Experimentally integrated into Ollama with custom GGML kernels for LLM efficiency.
Hardware-agnostic machine learning infrastructure for .NET. Implements high-performance neural network layers in C# that are transpiled to run on WebGPU, CUDA, OpenCL, WebGL, CPU, and Wasm via SpawnDev.ILGPU. Optimized for Blazor WebAssembly and native GPU execution.
TurboQuant‑style embedding compression for RAG: an SDK using fixed rotations, PolarQuant, and QJL residual sketches for compact storage and fast similarity search
KV Cache with PagedAttention vs PagedAttention + TurboQuant - experiments across token sizes comparing memory, latency, and accuracy.
TurboQuant (ICLR 2026) ported to Apple Silicon — KV cache compression with MLX Metal kernels + PyTorch CPU
Near-optimal vector quantization for LLM KV cache compression. Python implementation of TurboQuant (ICLR 2026) — PolarQuant + QJL for 3-bit quantization with minimal accuracy loss and up to 8x memory reduction.
AI agent skill implementing Google's TurboQuant compression algorithm (ICLR 2026) — 6x KV cache memory reduction, 8x speedup, zero accuracy loss. Compatible with Claude Code, Codex CLI, and all Agent Skills-compatible tools.
Interactive Benchmarking Tool for TurboQuant KV Cache Compression. Supports 2-4 bit quantization with Real-time Metrics
AI Code Review Memory - learns from your team's bug history and warns when similar patterns appear
Implementation of https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
Turbo Index
Experimental TurboQuant implementation and llama.cpp-style integration path for long-context inference
ChatMind: Semantic search for Discord & KakaoTalk chat messages. Search by meaning, not keywords. Powered by TurboQuant compression (ICLR 2026).
CommitMind: Semantic search for Git commit history powered by TurboQuant vector compression (ICLR 2026). Search commits by meaning, not just keywords.
Add a description, image, and links to the turboquant topic page so that developers can more easily learn about it.
To associate your repository with the turboquant topic, visit your repo's landing page and select "manage topics."