Unified LLM server with intelligent routing based on model ID. Optimized for Apple Silicon (M1/M2/M3) with MLX backend support.
Client Request (port 8000)
↓
Routing Service (FastAPI, port 8000)
↓ (reads model ID from request body)
Backend Model Servers (MLX/llama.cpp on ports 8500, 8501, 8502, ...)
- Apple Silicon optimized: MLX backend leverages Apple's Metal Performance Shaders for fast inference on M1/M2/M3 chips
- Single endpoint: All requests go to
http://localhost:8000/v1 - Model-based routing: Extracts model ID from request body and routes to correct backend
- Multi-backend support: MLX (Apple Silicon) and llama.cpp (cross-platform)
- Port per model: Each model runs on its own port (configured in
config/models.yaml) - OpenAI-compatible: Full compatibility with OpenAI API format
- macOS with Apple Silicon (M1, M2, M3, or later) - Recommended for MLX backend
- Python 3.12+
- uv package manager
Note: While llama.cpp backend works cross-platform, the MLX backend (recommended) requires Apple Silicon for optimal performance.
- Clone and install dependencies:
git clone <repository-url>
cd slm_server
uv sync- Install backend dependencies (optional, only what you need):
uv sync --extra mlx # For MLX backend
uv sync --extra llamacpp # For llama.cpp backend- Set up configuration files:
# Copy configuration template
cp config/models.yaml.example config/models.yaml
# Edit config/models.yaml with your model paths- Configure models (edit
config/models.yaml):
models:
router:
id: "qwen/qwen3-1.7b"
backend: "mlx" # or "llamacpp"
port: 8500
context_length: 32768
quantization: "8bit"
max_concurrency: 4
default_timeout: 10
supports_function_calling: false
model_path: "mlx-community/LFM2.5-1.2B-Instruct-8bit" # Hugging Face model ID (auto-downloads)
# Or use local path: "~/.cache/huggingface/hub/models--mlx-community--LFM2.5-1.2B-Instruct-8bit"- Start all services (recommended):
./start.shFor detailed setup instructions, see SETUP.md.
Or start individually:
# Terminal 1: Start backend servers
uv run python -m slm_server backends
# Terminal 2: Start routing service
uv run python -m slm_server routerSee config/models.yaml.example for a template. Copy it to config/models.yaml and configure your models.
Each model needs:
id: Model identifier (used for routing - must match what's in request body)backend: Backend type (mlxorllamacpp)port: Port number for this model's server (must be unique per model)context_length,quantization,max_concurrency,default_timeout: Model-specific settingsmodel_path: Path to model file or Hugging Face model ID
You can specify models in two ways:
-
Hugging Face model ID (recommended): Automatically downloads on first use
model_path: "mlx-community/LFM2.5-1.2B-Instruct-8bit"
-
Local file path: Full path to model directory or file
model_path: "~/.cache/huggingface/hub/models--mlx-community--LFM2.5-1.2B-Instruct-8bit" # Or any local directory containing the model
Models can be specified in two ways:
-
Hugging Face model ID: The server automatically downloads models from Hugging Face on first use. Models are cached in
~/.cache/huggingface/hub/. -
Local file path: Point directly to a model directory or file on your system. You can download models manually from Hugging Face or use any local directory.
The routing service exposes OpenAI-compatible endpoints:
Standard chat completions endpoint. Request body must include model field:
{
"model": "qwen/qwen3-1.7b",
"messages": [{"role": "user", "content": "Hello!"}]
}Responses API endpoint with automatic fallback. If the backend doesn't support /v1/responses (returns 404), the router automatically converts the request to /v1/chat/completions format. This provides compatibility with MLX and llama.cpp backends while maintaining LM Studio compatibility.
List all available models and their configurations.
Check health status of all configured backend servers. Returns status for each model:
healthy: Backend responding normallyunreachable: Backend not running (connection refused)timeout: Backend not respondingdisabled: Model disabled in config
Example response:
{
"router": {
"status": "healthy",
"model_id": "qwen/qwen3-1.7b",
"backend": "mlx",
"port": 8500
},
"standard": {
"status": "unreachable",
"error": "Connection refused - backend not running"
}
}Health check endpoint for the router itself.
- Client sends request to
http://localhost:8000/v1/chat/completionswith model ID in body - Routing service (port 8000) extracts
modelfield from JSON body - Router looks up model in
config/models.yamlto find backend and port - Router forwards request to correct backend server (e.g.,
http://localhost:8500/v1/chat/completions) - Backend processes request and returns response
- Router forwards response back to client
Each model must have its own unique port. The router uses the port from config/models.yaml to route requests. Example:
routermodel → port 8500 (MLX)standardmodel → port 8501 (MLX)reasoningmodel → port 8502 (MLX)codingmodel → port 8503 (MLX)
The routing service runs on port 8000.
- Install:
uv sync --extra mlx - Requires:
mlx-openai-servercommand - Supports Hugging Face model IDs (auto-downloads) or local model paths
- Optimized for Apple Silicon - Uses Metal Performance Shaders for GPU acceleration
- Best performance on M1/M2/M3 Macs
- Install:
uv sync --extra llamacpp - Requires:
llama-cpp-python[server]package - Supports Hugging Face model IDs (auto-downloads) or local
.gguffiles - Cross-platform support
First, check if all backends are running:
curl http://localhost:8000/v1/backends/health | jqThis will show the status of each backend server and help identify issues.
- Check that model ID in request matches
idfield inconfig/models.yaml - Verify model file exists at the specified path, or use a Hugging Face model ID for auto-download
- Check config validation warnings on startup
- For Hugging Face models, ensure you have internet access for the first download
- Each model needs a unique port
- Check what's using the port:
lsof -i :8500 - Change port in
config/models.yaml - Config validation will warn about port conflicts on startup
- Check
/v1/backends/healthendpoint to see which backends are down - Verify backend dependencies installed:
uv sync --extra mlxoruv sync --extra llamacpp - Check logs for error messages
- Ensure model paths are correct and files exist
- For Hugging Face models, ensure you have internet access and sufficient disk space
- Backend server may not be running
- Check
/v1/backends/healthto verify backend status - Restart backend servers:
uv run python -m slm_server backends - Verify firewall isn't blocking local connections