Experimental TurboQuant implementation and integration path for a llama.cpp-style runtime.
This repository is designed to help builders who want:
- a portable TurboQuant core
- practical benchmark tooling
- reviewable
llama.cppintegration patches - honest guidance about speed, memory, and validation status
TurboQuant is exciting, but most users still need a practical path from paper ideas to runtime experiments.
This repo tries to make that path useful:
- a portable C++ core you can build quickly
- extracted
llama.cpppatch sets you can inspect and discuss - validation scripts for baseline vs TurboQuant runs
- explicit usage profiles instead of one-size-fits-all claims
| Area | What You Get |
|---|---|
| Portable core | Rotation, quantization, and TurboQuant building blocks in standalone C++ |
| Runtime path | Reviewable llama.cpp-style integration patches |
| Validation | CLI-based baseline vs TurboQuant comparison script |
| Docs | Architecture, profiles, validated model notes, and roadmap |
- studying TurboQuant-style long-context tradeoffs
- experimenting with KV compression in a
llama.cppecosystem - benchmarking speed-first vs memory-first profiles on local hardware
- extracting reusable algorithm pieces for other backends
This project is the publishable home for the V6 Alpha TurboQuant work:
- exact TurboQuant algorithm experiments
- fused CUDA attention-path experiments
- packed KV-cache storage experiments
- long-context benchmarking on local GPUs
This is a lab implementation and research bridge.
It is:
- not the original Google release
- not a claim of paper-perfect production parity
- a practical open implementation path for TurboQuant ideas in a
llama.cppecosystem
It aims to be useful for users who want to:
- study the algorithm
- build the portable pieces quickly
- evaluate long-context tradeoffs on their own hardware
- apply and iterate on the runtime integration as patch sets
- portable core in
cpp/ - benchmark tool in
cpp/tools - validator in
scripts/validate_llama_cli.py - extracted patch sets in
patches/llama.cpp/generated - benchmark summary in
docs/BENCHMARK_HIGHLIGHTS.md
The standalone TurboQuant core is model-agnostic.
It operates on:
- vector dimension
- bit budget
- QJL projection size
- optional outlier-channel settings
So the core itself is not tied to Qwen3.5-9B.
The current runtime-oriented work has been validated mainly against:
Qwen3.5-9B-Q8_0.ggufQwen3.5-27B-Q3_K_M.gguf
That means:
- the algorithm is generic
- the current
llama.cppintegration and rollout tuning are still Qwen-focused
Other transformer-family models should be possible, but they may need:
- different layer rollout
- different bit/QJL tuning
- fresh validation
- model-specific long-context benchmarking
The current public scope includes:
- a portable TurboQuant C++ core
- a standalone benchmark tool
- a validation script for
llama.cppruns - extracted
llama.cpppatch sets - architecture, quickstart, usage-profile, benchmark-highlight, validated-model, and roadmap documentation
cpp/Portable TurboQuant C++ core and standalone tools.scripts/Validation and patch-export tooling.docs/Architecture, publishing notes, quickstart, and usage guidance.patches/llama.cpp/Home for extractedllama.cppintegration patch sets.benchmarks/Benchmark notes and future published reports.assets/Visual assets for the GitHub landing page and future publishing material.
This repository is licensed under Apache-2.0.
See LICENSE.
- contribution guide: CONTRIBUTING.md
- code of conduct: CODE_OF_CONDUCT.md
- support policy: SUPPORT.md
- security policy: SECURITY.md
- roadmap: docs/ROADMAP.md
- benchmark highlights: docs/BENCHMARK_HIGHLIGHTS.md
- launch kit: docs/LAUNCH_KIT.md
If this repo helps you:
- star the repository
- share the repo link with your benchmark notes
- open an issue with reproducible results
- contribute model-validation results for hardware other than the current Qwen-focused runs
Build the standalone benchmark:
cmake -S . -B build
cmake --build build -j
./build/turboquant-bench --dim 128 --bits 4 --qjl-dim 128 --samples 32 --queries 8Validate a llama.cpp binary against baseline and TurboQuant profiles:
python3 scripts/validate_llama_cli.py compare \
--bin /path/to/llama-cli \
--model /path/to/model.gguf \
--output-dir /tmp/turboquant-compareExport the current lab integration into reviewable patch files:
scripts/export_llama_cpp_patches.shThe generated patch sets are written under:
patches/llama.cpp/generated/
For a guided start, read:
- docs/QUICKSTART.md
- docs/USAGE_PROFILES.md
- docs/VALIDATED_MODELS.md
- docs/BENCHMARK_HIGHLIGHTS.md
- docs/LAUNCH_KIT.md
- portable algorithm core
- standalone benchmark executable
- validation tooling
- generated
llama.cpppatch sets - documentation for profiles and validated models
- full runtime behavior across all model families
- universal speed wins
- production-grade stability across every rollout shape
This repository does not claim:
- official affiliation with Google
- official TurboQuant reference status
- guaranteed paper-level benchmark parity on every backend
- one universal best profile for all workloads