Add support for ParoQuant and custom quantization method loading by liang2kl · Pull Request #209 · jundot/omlx

liang2kl · 2026-03-13T10:09:37Z

PR Summary

I would like to bring support for ParoQuant to omlx to allow loading ParoQuant models directly.

ParoQuant is a newly proposed quantization method that achieves much higher accuracy than current state of the arts (see ml-explore/mlx-lm#977 (comment)), while introducing very small overhead. It is also very friendly to mlx-lm, only needing to apply an additional lightweight transform before executing the mlx-native quantized matrix multiplication. The official ParoQuant repository already contains a custom model loader for mlx, so the required modifications are minimal.

To support ParoQuant and potentially other non-standard quantization methods, this PR refactors model loading into a future-friendly custom quantization loading flow, with minimal changes to the original codebase.

What changed

Added a centralized model loader that:
- detects quantization_config.quant_method from local config.json
- dispatches supported custom quantization formats (currently paroquant)
- falls back to standard mlx-lm / mlx-vlm loaders when no custom format is detected
Updated core load call sites to use the shared loader:
- batched LLM engine
- VLM engine
- LLM model wrapper
- causal-LM reranker loader

Why

Enables ParoQuant models without invasive engine changes.
Creates a clean path to add more custom quantization formats later.
Keeps external API small and maintainable.

Preliminary benchmarks of z-lab/Qwen3.5-4B-PARO on my 16G M4, compared with mlx-community/Qwen3.5-4B-4bit:

jundot · 2026-03-13T13:48:44Z

Really interesting project. To be honest, i've been skeptical about mlx's quantization quality, so paro's mlx support feels meaningful. Personally i'd love to test at least qwen3.5 122b or larger, but it seems like quantization is only possible on CUDA for now. Really wish that wasn't the case. (Providing quantized versions of popular mid-to-large scale models would really help with adoption.)

As for the code, it looks generally solid. One minor note: when the paroquant loading path is taken, tokenizer_config from the caller is silently dropped. It doesn't cause real issues right now since the config only carries trust_remote_code (which defaults to false anyway), but it'd be better to pass it through to stay consistent with how omlx handles tokenizer setup.

Overall this looks good to me. Since there aren't many paro models available yet, i'd like to do more testing once additional models are uploaded before merging. Just give me a little time!

liang2kl · 2026-03-13T14:02:25Z

Thanks! We are currently quantizing larger models like Qwen3.5-27B and will work on larger MoEs later. We are also exploring ways to make quantization feasible on edge devices; as of now the computation resources required for calibration are prohibitive. I'll let you know once the models are ready!

Add support for ParoQuant and custom quantization method loading

14508e6

jundot force-pushed the main branch 10 times, most recently from 338f98a to d90bf8a Compare March 22, 2026 10:25

jundot force-pushed the main branch 11 times, most recently from 187e87b to dfc5b20 Compare March 29, 2026 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for ParoQuant and custom quantization method loading#209

Add support for ParoQuant and custom quantization method loading#209
liang2kl wants to merge 1 commit intojundot:mainfrom
liang2kl:paroquant

liang2kl commented Mar 13, 2026 •

edited

Loading

Uh oh!

jundot commented Mar 13, 2026

Uh oh!

liang2kl commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liang2kl commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Uh oh!

jundot commented Mar 13, 2026

Uh oh!

liang2kl commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liang2kl commented Mar 13, 2026 •

edited

Loading