MOLA — multi-LoRA inference server for MLX: load the model once, switch adapters per request #3323

0xbstn · 2026-03-25T22:58:03Z

0xbstn
Mar 25, 2026

On CUDA, multi-LoRA serving already exists (vLLM --enable-lora, LoRAX). On MLX, switching LoRA adapters still means reloading the full base model. Related open issues:

lmstudio-ai/mlx-engine#51 — LoRA support
waybarrios/vllm-mlx#176 — Dynamic LoRA adapters
ollama/ollama#7627 — Support multiple LoRA adapters

MOLA keeps one base model resident in memory and applies LoRA deltas dynamically per request. No weight merging, no model reloads. Each adapter is ~50-200 MB and hot-swappable at runtime.

How it works

The base model weights stay intact. At each forward pass, the active adapter's delta is applied on-the-fly via per-request dispatch:

# mlx-lm today: merges LoRA into weights (destructive, single adapter)
model.weights += lora_A @ lora_B

# MOLA: applies adapter delta at inference time (non-destructive, N adapters)
base_output = linear(x)
delta = scale * (x @ lora_A) @ lora_B   # selected per request
output = base_output + delta

For mixed-adapter decode batches (multiple adapters decoding simultaneously), MOLA routes deltas per token row using mx.gather_mm with slot-indexed adapter packs.

Benchmark

Hardware: M5 Max 64 GB — Model: mlx-community/Qwen3.5-9B-MLX-4bit — 8 adapters loaded — Backend: gather-mm — Batch: 128 / prefill 32

Concurrency	Same-adapter tok/s	Mixed-adapter tok/s	Multi-LoRA overhead
1	76.4	76.4	0%
16	308.8	241.4	-22%
64	732.3	555.5	-24%

At concurrency 1, same and mixed are the same shape. The overhead appears once requests from different adapters overlap in the decode batch.

Quickstart

git clone https://github.com/0xbstn/mola.git && cd mola
python3 -m venv .venv && ./.venv/bin/pip install -e ".[dev]"
./.venv/bin/python devtools/apply_mlx_lm_detached_batch_api.py

./.venv/bin/python devtools/run_mola_current_architecture.py start \
  --model mlx-community/Qwen3.5-9B-MLX-4bit \
  --adapter rust ./adapters/rust-lora \
  --adapter sql ./adapters/sql-lora \
  --port 8000

curl localhost:8000/v1/chat/completions -d '{
  "model": "rust",
  "messages": [{"role": "user", "content": "Implement a lock-free queue"}],
  "stream": true
}'

Current state (alpha)

Works on Apple Silicon with mlx-lm 0.31.1
OpenAI-compatible chat API with per-request adapter routing
Hot-load / hot-unload adapters at runtime via REST
Still requires a local mlx-lm patch for the detached batch API (script included)
Mixed prefill and deeper KV residency management are open problems

GitHub: https://github.com/0xbstn/mola

If you benchmark MOLA on your hardware or have feedback on the approach, happy to hear it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MOLA — multi-LoRA inference server for MLX: load the model once, switch adapters per request #3323

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

MOLA — multi-LoRA inference server for MLX: load the model once, switch adapters per request #3323

Uh oh!

0xbstn Mar 25, 2026

How it works

Benchmark

Quickstart

Current state (alpha)

Replies: 0 comments

0xbstn
Mar 25, 2026