Skip to content

Add support for ParoQuant and custom quantization method loading#209

Open
liang2kl wants to merge 1 commit intojundot:mainfrom
liang2kl:paroquant
Open

Add support for ParoQuant and custom quantization method loading#209
liang2kl wants to merge 1 commit intojundot:mainfrom
liang2kl:paroquant

Conversation

@liang2kl
Copy link
Copy Markdown

@liang2kl liang2kl commented Mar 13, 2026

PR Summary

I would like to bring support for ParoQuant to omlx to allow loading ParoQuant models directly.

ParoQuant is a newly proposed quantization method that achieves much higher accuracy than current state of the arts (see ml-explore/mlx-lm#977 (comment)), while introducing very small overhead. It is also very friendly to mlx-lm, only needing to apply an additional lightweight transform before executing the mlx-native quantized matrix multiplication. The official ParoQuant repository already contains a custom model loader for mlx, so the required modifications are minimal.

To support ParoQuant and potentially other non-standard quantization methods, this PR refactors model loading into a future-friendly custom quantization loading flow, with minimal changes to the original codebase.

What changed

  • Added a centralized model loader that:
    • detects quantization_config.quant_method from local config.json
    • dispatches supported custom quantization formats (currently paroquant)
    • falls back to standard mlx-lm / mlx-vlm loaders when no custom format is detected
  • Updated core load call sites to use the shared loader:
    • batched LLM engine
    • VLM engine
    • LLM model wrapper
    • causal-LM reranker loader

Why

  • Enables ParoQuant models without invasive engine changes.
  • Creates a clean path to add more custom quantization formats later.
  • Keeps external API small and maintainable.

Preliminary benchmarks of z-lab/Qwen3.5-4B-PARO on my 16G M4, compared with mlx-community/Qwen3.5-4B-4bit:

Screenshot 2026-03-13 at 5 39 05 PM

@jundot
Copy link
Copy Markdown
Owner

jundot commented Mar 13, 2026

Really interesting project. To be honest, i've been skeptical about mlx's quantization quality, so paro's mlx support feels meaningful. Personally i'd love to test at least qwen3.5 122b or larger, but it seems like quantization is only possible on CUDA for now. Really wish that wasn't the case. (Providing quantized versions of popular mid-to-large scale models would really help with adoption.)

As for the code, it looks generally solid. One minor note: when the paroquant loading path is taken, tokenizer_config from the caller is silently dropped. It doesn't cause real issues right now since the config only carries trust_remote_code (which defaults to false anyway), but it'd be better to pass it through to stay consistent with how omlx handles tokenizer setup.

Overall this looks good to me. Since there aren't many paro models available yet, i'd like to do more testing once additional models are uploaded before merging. Just give me a little time!

@liang2kl
Copy link
Copy Markdown
Author

Thanks! We are currently quantizing larger models like Qwen3.5-27B and will work on larger MoEs later. We are also exploring ways to make quantization feasible on edge devices; as of now the computation resources required for calibration are prohibitive. I'll let you know once the models are ready!

@jundot jundot force-pushed the main branch 10 times, most recently from 338f98a to d90bf8a Compare March 22, 2026 10:25
@jundot jundot force-pushed the main branch 11 times, most recently from 187e87b to dfc5b20 Compare March 29, 2026 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants