Multi-adapter Orchestration LoRA Apple
Multi-LoRA inference server for Apple Silicon: one base model, many adapters, no reload.
MOLA serves multiple LoRA adapters from one MLX base model on Apple Silicon. The base model stays resident in memory, adapters are selected per request, and same-adapter traffic is batched automatically.
Status: Alpha. The published benchmark below uses
mlx-community/Qwen3.5-9B-MLX-4bitwith 8 resident adapters onmlx-lm 0.31.1.
| Approach | Runtime shape |
|---|---|
| Separate fine-tuned models | one full model per specialty, higher memory, reloads when switching |
| MOLA | one base model plus many LoRA adapters, lower memory, no model reloads |
This is the practical tradeoff MOLA is built for: keep one base model resident, switch adapters per request, and avoid reloading full fine-tuned checkpoints.
- OpenAI-compatible chat completions API
- Per-request adapter selection via the
modelfield - Multiple LoRA adapters loaded at once on one base model
- Same-adapter batching and stable mixed-adapter serving
- Hot-load and hot-unload adapters at runtime
- Streaming responses
- Runtime metrics via
/v1/engine/metrics
git clone https://github.com/0xbstn/mola.git
cd mola
python3 -m venv .venv
./.venv/bin/python -m ensurepip --upgrade
./.venv/bin/python -m pip install -e ".[dev]"
./.venv/bin/python devtools/apply_mlx_lm_detached_batch_api.pyStart the recommended runtime profile:
./.venv/bin/python devtools/run_mola_current_architecture.py start \
--model mlx-community/Qwen3.5-9B-MLX-4bit \
--adapter rust ./path/to/rust-adapter \
--adapter sql ./path/to/sql-adapter \
--port 8000Equivalent explicit command:
./.venv/bin/python -m mola.cli -v serve \
--model mlx-community/Qwen3.5-9B-MLX-4bit \
--adapter rust ./path/to/rust-adapter \
--adapter sql ./path/to/sql-adapter \
--adapter support ./path/to/support-adapter \
--max-inflight-tokens 131072 \
--max-batch-size 128 \
--prefill-batch-size 32 \
--enable-routed-decode-reference \
--strict-routed-decode-reference \
--routed-decode-backend gather-mm \
--enable-mixed-decode-migration \
--prestep-mixed-decode-migration \
--cache-routed-decode-sessions \
--detached-shared-decode-owner \
--port 8000Stop or restart:
./.venv/bin/python devtools/run_mola_current_architecture.py stop --port 8000
./.venv/bin/python devtools/run_mola_current_architecture.py restart --port 8000The model field is a strict selector. Unknown adapter names return 404.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "rust",
"messages": [{"role": "user", "content": "Implement a lock-free queue with crossbeam"}],
"stream": true
}'curl http://localhost:8000/v1/adapters
curl -X POST http://localhost:8000/v1/adapters \
-H "Content-Type: application/json" \
-d '{"name": "medical", "path": "./path/to/medical-adapter"}'
curl -X DELETE http://localhost:8000/v1/adapters/medicalPOST /v1/chat/completions: OpenAI-compatible chat completionsGET /v1/models: base model and loaded adaptersGET /v1/adapters: loaded adapter metadataPOST /v1/adapters: hot-load an adapterDELETE /v1/adapters/{name}: unload an adapterGET /v1/engine/metrics: runtime countersGET /health: health and model summary
Reproduce the published benchmark with:
./.venv/bin/python scripts/bench_server.py \
--routed-validation \
--concurrency 1,16,64 \
--repeats 3 \
--json-out /tmp/mola-qwen35-9b-current-architecture-bench-1-16-64.jsonPublished profile:
- model:
mlx-community/Qwen3.5-9B-MLX-4bit - adapters:
rust,sql,medical,cyber,solidity,devops,math,legal - backend:
gather-mm - batch sizes:
128 / 32 - machine:
Apple M5 Max 64GB
This benchmark shows how much performance MOLA keeps when traffic moves from one adapter at a time to a real mixed multi-adapter workload.
| Concurrency | Same tok/s | Mixed tok/s | Multi-LoRA overhead | Mixed p95 |
|---|---|---|---|---|
| 1 | 76.4 | 76.4 | 0% | 843 ms |
| 16 | 308.8 | 241.4 | -22% | 4220 ms |
| 64 | 732.3 | 555.5 | -24% | 7372 ms |
At concurrency 1, same and mixed are effectively the same shape; the useful signal starts once requests overlap.
At moderate to high load, mixed multi-adapter traffic adds about 22-24% throughput overhead relative to same-adapter traffic.
From the same run:
long-decode-mixedtok/s:81.1at1,283.4at16,691.1at64hot/cold skew mixtok/s:77.4at1,241.3at16,559.1at64
If you benchmark MOLA on another Apple Silicon machine or model, feel free to open an issue with your hardware, model, and results.
MOLA loads standard PEFT / mlx-lm adapter directories:
my-adapter/
├── adapter_config.json
└── adapters.safetensors
Train adapters with mlx-lm, mlx-tune, or any tool that outputs PEFT-compatible safetensors.
- Apple Silicon Mac (M1 or later)
- Python 3.11+
- macOS 13+
mlx-lm 0.31.1recommended
- Alpha release
- Apple Silicon only
- A local
mlx-lmpatch is still required for the recommended setup - Switching adapters inside one conversation invalidates KV cache reuse
- Mixed prefill and deeper KV/adaptor residency management are still open problems
Apache 2.0