Open
Conversation
Contributor
Author
mattjcly
approved these changes
Mar 31, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

This PR unified the Qwen 3.5 model architecture. This requires a patch to the upstream mlx-lm model class to be able to use mrope for multi-modal inputs. There is a large amount of unification code here because of some slight differences in the mlx-lm versus mlx-vlm implementations.To be faithful to each, I call the upstream mlx-lm components for text-only prompts and I have re-implemented the mlx-vlm components outside of their classes for accessibility on vision prompts.
Patched DecoderLayer
I have patched the upstream mlx-lm decoder layer to call mlx-lm's
GatedDeltaNetorQwen3NextAttention(depending on the layer type) for text-only prompts. For vision prompts, it calls the ported implementations from mlx-vlm in_vlm_gated_delta_netand_mrope_attention. The former theoretically is the same across text and vision prompts, but there are some slight differences in the implementations that lead to logits differences.Patched Text Model
I patched the mlx-lm text model to manage the lifecycle of
position_idsandrope_deltasfrom mlx-vlm. It uses these values in_compute_position_idsfor multi-modal prompts. The implementation is based on mlx-vlm's position ID computation.MOE Model
The MOE model implementation in mlx-lm inherits from the dense model, so we only need to patch the dense implementation. It has a different arch though, so we need both vision add ons. I made the MOE vision add on inherit from the dense add on, as the implementation is mostly shared.
New VisionAddOn Lifecycle Method
The VisionAddOn has a new
clear_prediction_statemethod, which is called byModelKitat the start ofprocess_prompt. Qwen 3.5 needs this to reset theposition_idsandrope_deltason the Patched text model. It is generic, so could be used for other model state that needs to be cleared between sequential generations in the future.Qwen VL Utils
The
compute_qwen_vl_embeddingsmethod now returns a named dataclass instead of a tuple for readability. It returns an additional field,grid_thw. The caller vision addons have been updated.E2E Tests
Added the standard
test_qwen3_5_vision,test_qwen3_5_text_only,test_qwen3_5_moe_vision, andtest_qwen3_5_moe_text_onlytests. There are a few more for this model though to cover issues I hit:test_qwen3_5_vision_then_text_only: Ensures that the mrope state that is patched onto the mlx-lm model implementation does not leak across requests. It does this by observing text generated by a text only prompt with deterministic sampling params is the same before and after a vision prompttest_qwen3_5_multi_image_process_prompt_preserves_image_positions: Checks that mrope position IDs are applied to all images when there are multiple in a prompt. It does this by finding the image tokens in the prompt and inspecting their position IDs.Patch Tests
I added tests to cover just the patched model implementation to ensure it if functioning correctly.
test_qwen3_5_prefill_decode_consistency: Assert thatmodel(all_tokens)[-1] == model(tokens[:-1]); model(tokens[-1])for both text and vision prompts.test_qwen3_5_mrope_chunked_prefill_matches_unchunked: Assert that prefill chunk boundary does not affect result.test_qwen3_5_text_only_uncached_matches_prompt_cache: Assert that text only prefill is the same whether a cache is provided or not.test_qwen3_5_text_only_batch_cache_matches_prompt_cache: Assert that text only prompts can use a BatchKVCache for parallel generation. Note this only works if the model as a non-vision model.test_qwen3_5_text_only_patched_matches_unpatched: Text-only logits from the patched mlx-lm model must match the unpatched mlx-lm model for both dense and MoE Qwen3.5 variants.test_qwen3_5_image_prompt_patched_matches_vlm: Image-prompt logits from the patched mlx-lm model must match the native mlx-vlm LanguageModel for both dense and MoE Qwen3.5 variants.