-
Notifications
You must be signed in to change notification settings - Fork 516
Description
Hi MLX community. I have a Qwen3.5-4B LoRA adapter that I merged with the base model and converted to MLX format using mlx_lm.convert with quantize=False. (the goal is to run the fined tune model on Mac studio/macbook)
When I run greedy decoding (argmax, no sampling) on the same prompt, the MLX version produces different output compared to the original HuggingFace Transformers version (both base+LoRA and merged weights in HF format produce identical results -- so I suspect the conversion to mlx may have issues). The first ~20-30 tokens match, then the outputs starts diverge (bad results and quality). Both backends use BF16 weights. I verified that weight loading, tokenization, the LoRA merge, and the MLX conversion but can't find any clues. Is MLX expected to produce the exact same output as HuggingFace Transformers for the same model and prompt with greedy decoding. (I am using MLX-lm now). Did anyone have experience on this before? Thank you!
One idea I have is that Qwen3.5 is vision-text unified model, I tried mlx-vlm but seems qwen3.5 is not supported there.
Thanks