Add support for ParoQuant and custom quantization method loading#209
Add support for ParoQuant and custom quantization method loading#209liang2kl wants to merge 1 commit intojundot:mainfrom
Conversation
|
Really interesting project. To be honest, i've been skeptical about mlx's quantization quality, so paro's mlx support feels meaningful. Personally i'd love to test at least qwen3.5 122b or larger, but it seems like quantization is only possible on CUDA for now. Really wish that wasn't the case. (Providing quantized versions of popular mid-to-large scale models would really help with adoption.) As for the code, it looks generally solid. One minor note: when the paroquant loading path is taken, Overall this looks good to me. Since there aren't many paro models available yet, i'd like to do more testing once additional models are uploaded before merging. Just give me a little time! |
|
Thanks! We are currently quantizing larger models like Qwen3.5-27B and will work on larger MoEs later. We are also exploring ways to make quantization feasible on edge devices; as of now the computation resources required for calibration are prohibitive. I'll let you know once the models are ready! |
338f98a to
d90bf8a
Compare
187e87b to
dfc5b20
Compare
PR Summary
I would like to bring support for ParoQuant to omlx to allow loading ParoQuant models directly.
ParoQuant is a newly proposed quantization method that achieves much higher accuracy than current state of the arts (see ml-explore/mlx-lm#977 (comment)), while introducing very small overhead. It is also very friendly to
mlx-lm, only needing to apply an additional lightweight transform before executing the mlx-native quantized matrix multiplication. The official ParoQuant repository already contains a custom model loader for mlx, so the required modifications are minimal.To support ParoQuant and potentially other non-standard quantization methods, this PR refactors model loading into a future-friendly custom quantization loading flow, with minimal changes to the original codebase.
What changed
quantization_config.quant_methodfrom localconfig.jsonparoquant)mlx-lm/mlx-vlmloaders when no custom format is detectedWhy
Preliminary benchmarks of
z-lab/Qwen3.5-4B-PAROon my 16G M4, compared withmlx-community/Qwen3.5-4B-4bit: