MOLA — multi-LoRA inference server for MLX: load the model once, switch adapters per request #3323
0xbstn
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
On CUDA, multi-LoRA serving already exists (vLLM
--enable-lora, LoRAX). On MLX, switching LoRA adapters still means reloading the full base model. Related open issues:MOLA keeps one base model resident in memory and applies LoRA deltas dynamically per request. No weight merging, no model reloads. Each adapter is ~50-200 MB and hot-swappable at runtime.
How it works
The base model weights stay intact. At each forward pass, the active adapter's delta is applied on-the-fly via per-request dispatch:
For mixed-adapter decode batches (multiple adapters decoding simultaneously), MOLA routes deltas per token row using
mx.gather_mmwith slot-indexed adapter packs.Benchmark
Hardware: M5 Max 64 GB — Model:
mlx-community/Qwen3.5-9B-MLX-4bit— 8 adapters loaded — Backend:gather-mm— Batch: 128 / prefill 32At concurrency 1, same and mixed are the same shape. The overhead appears once requests from different adapters overlap in the decode batch.
Quickstart
Current state (alpha)
GitHub: https://github.com/0xbstn/mola
If you benchmark MOLA on your hardware or have feedback on the approach, happy to hear it.
Beta Was this translation helpful? Give feedback.
All reactions