TurboQuant support? #1064

QROST · 2026-03-26T19:25:44Z

QROST
Mar 26, 2026

https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/

Someone already trying it with promising ram saving:
https://www.reddit.com/r/LocalLLaMA/comments/1s36vnk/looking_for_feedback_porting_googles_turboquant/

QROST · 2026-03-26T19:35:45Z

QROST
Mar 26, 2026
Author

From Gemini 3.1 Pro:

Proposal: Architectural Support for TurboQuant Integration in mlx-lm
Technical Context
TurboQuant introduces a specific bit-packing and dequantization logic that deviates from the current grouped quantization (Q4_0, Q4_1, etc.) implemented in MLX. To achieve the performance gains promised by this method on Apple Silicon, a synchronized update across the runtime and the conversion pipeline is required.
Required Modifications to mlx-lm
Integrating TurboQuant necessitates deep changes to the model loading and inference logic.
Custom Metal Kernels: The primary bottleneck involves writing specialized fused kernels in Metal. These kernels must handle the TurboQuant-specific memory layout to avoid the overhead of intermediate dequantization in Python.
New Layer Archetypes: A dedicated TurboLinear layer should be added to the nn module. This layer will manage unique scaling factors and bit-width configurations during the forward pass.
Metadata Parsing: The model loader requires an update to recognize a new quantization_method: "turboquant" key within the config.json. This ensures the architecture correctly routes tensors to the specialized kernels rather than the standard linear implementation.
Quantization Pipeline & mlx-community Effort
The transition to TurboQuant implies a significant shift for the model-hosting ecosystem.
Repacking Logic: Existing safetensors files cannot be patched. We must develop new conversion scripts that ingest FP16 weights and repack them into the TurboQuant bit-stream format.
Calibration Requirements: Unlike simple round-to-nearest quantization, TurboQuant may require a calibration step with a representative dataset to maintain high precision at lower bit-rates. This adds computational complexity to the initial conversion process for every model in the mlx-community.
Backward Compatibility and Ecosystem Impact
Maintaining stability for the existing user base is critical during this transition.
Legacy Support: The introduction of TurboQuant should not deprecate current grouped quantization methods. Existing models in the community must remain functional using the current quantized_linear paths.
Interoperability: TurboQuant models will inherently lack backward compatibility with older versions of mlx-lm. Users attempting to load these models on unpatched versions of the library will encounter shape mismatches or kernel execution errors.
Versioning Strategy: A clear versioning flag in the model metadata is necessary to prevent runtime crashes when users mix-and-match library versions and model formats.

0 replies

QROST · 2026-03-26T19:43:03Z

QROST
Mar 26, 2026
Author

Also: https://huggingface.co/flovflo/turboquant-mlx-qwen35-kv

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TurboQuant support? #1064

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

TurboQuant support? #1064

Uh oh!

QROST Mar 26, 2026

Replies: 2 comments

Uh oh!

QROST Mar 26, 2026 Author

Uh oh!

QROST Mar 26, 2026 Author

QROST
Mar 26, 2026

QROST
Mar 26, 2026
Author

QROST
Mar 26, 2026
Author