Production Case Study: Qwen 3.5 VLM on MLX for Healthcare AI #3195
asq-sheriff
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
We deployed Qwen 3.5-4B on Apple Silicon for elderly care AI using mlx-vlm.
Key findings:
• mlx-vlm required (not mlx-lm) due to VLM architecture
• 3x latency improvement over llama.cpp for DeltaNet (20.7s → 6.9s)
• Serial queuing outperformed continuous batching for our conversational workload
• Patched chat template to disable thinking mode by default
This feedback may help with future MLX optimizations for DeltaNet architectures.
Happy to share more details about our deployment.
[Medium Link]
Beta Was this translation helpful? Give feedback.
All reactions