GitHub - darkmaniac7/LLM-Model-Router: A Model Router/Switcher for llama.cpp, sglang, vLLM & TabbyAPI with OpenWebui API

Multi-Backend LLM Router v4.0.0

Production-ready router supporting SGLang (AWQ), llama.cpp (GGUF), and TabbyAPI (EXL2) with automatic model switching through a unified OpenAI-compatible API.

Features

3 Backend Support: Seamlessly switch between GGUF, AWQ, and EXL2 models
Automatic Model Switching: Router handles backend lifecycle
Systemd Management: Reliable service control with proper monitoring
Health Monitoring: Intelligent checks with actual inference validation
Streaming Support: Full streaming for all backends
Performance Metrics: Real-time tokens/sec display after each response
OpenAI Compatible: Drop-in replacement for OpenAI API

Quick Start

# 1. Clone repository
git clone https://github.com/darkmaniac7/LLM-Model-Router.git
cd LLM-Model-Router

# 2. Install router
sudo mkdir -p /opt/llm-router
sudo cp router.py /opt/llm-router/
sudo python3 -m venv /opt/llm-router/venv
sudo /opt/llm-router/venv/bin/pip install fastapi uvicorn httpx pyyaml

# 3. Configure
sudo cp config/config.json.example /opt/llm-router/config.json
# Edit /opt/llm-router/config.json with your model paths

# 4. Install service
sudo cp systemd/llm-router.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now llm-router.service

# 5. Test
curl http://localhost:8002/v1/models

Documentation

Quick Start Guide - Get running in 10 minutes
TabbyAPI Installation - Complete EXL2 backend setup
Configuration Examples - Sample configs for all backends

Backends

Backend	Format	Best For	Memory
SGLang	AWQ	Fast inference, high throughput	Medium-High
llama.cpp	GGUF	CPU/GPU hybrid, flexibility	Low-Medium
TabbyAPI	EXL2	Maximum quality, NVIDIA only	High

Requirements

Ubuntu 22.04+ or compatible Linux
NVIDIA GPU with CUDA 12.4+
Python 3.10+
64GB+ RAM (for 70B models)
Root access for systemd services

Usage

List Models

curl http://localhost:8002/v1/models

Chat Completion

curl -X POST http://localhost:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-name",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Health Check

curl http://localhost:8002/health

Performance Metrics

The router automatically displays performance stats after each response:

Your response text here...

⚡ 45.2 tok/s (180 tokens in 4.0s)

🛠️ Installation Scripts

Interactive Installer (`install.sh`)

The included installer script automates the entire setup process:

sudo ./install.sh

What it does:

Detects Python - Finds Python 3.10+ automatically
Prompts for Configuration:
- Install directory (default: /opt/llm-router)
- Router port (default: 8002)
- Backend hosts and ports (SGLang, llama.cpp, TabbyAPI)
- TabbyAPI paths (install dir, model directory)
Creates Virtual Environment - Installs all Python dependencies
Generates Config - Creates /opt/llm-router/config.json with your settings
Installs Systemd Service - Sets up llm-router.service with proper environment variables
Copies Management Script - Installs manage-models.sh for easy model management

Example Prompts:

Install directory [/opt/llm-router]: 
Router port [8002]: 
SGLang port [30000] (empty to disable): 
llama.cpp port [8085] (empty to disable): 
TabbyAPI port [5000] (empty to disable): 
TabbyAPI install directory [/opt/TabbyAPI]: 
Model directory [/opt/models]:

The installer creates a complete, ready-to-run setup with all paths configured correctly.

Model Management Script (`manage-models.sh`)

Easy model management without manual JSON editing:

# Interactive menu
/opt/llm-router/manage-models.sh

# Or use directly
/opt/llm-router/manage-models.sh list    # List all configured models
/opt/llm-router/manage-models.sh add     # Add a new model
/opt/llm-router/manage-models.sh remove  # Remove a model

Add a Model (Interactive):

Choose name: my-llama-70b
Select backend:
- 1) llama.cpp (for GGUF models)
- 2) sglang (for AWQ models)
- 3) tabbyapi (for EXL2 models)
Enter path: /opt/models/gguf/llama-70b-q4.gguf
Restart router: y/n

The script automatically:

Updates /opt/llm-router/config.json
Validates JSON format
Offers to restart the router service
Shows current models with list command

Example Output:

Current Models:

  • my-llama-70b
    Backend: llamacpp
    Path: /opt/models/gguf/llama-70b-q4.gguf

  • deepseek-awq
    Backend: sglang
    Path: /opt/models/awq/DeepSeek-R1-70B

Configuration

Router Config (/opt/llm-router/config.json)

{
  "router_port": 8002,
  "model_load_timeout": 300,
  "backends": {
    "sglang": {"port": 30000, "host": "localhost"},
    "llamacpp": {"port": 8085, "host": "localhost"},
    "tabbyapi": {"port": 5000, "host": "localhost"}
  },
  "models": {
    "your-gguf-model": {
      "backend": "llamacpp",
      "model_path": "/path/to/model.gguf"
    },
    "your-awq-model": {
      "backend": "sglang",
      "model_path": "/path/to/awq-model"
    },
    "your-exl2-model": {
      "backend": "tabbyapi",
      "model_path": "exl2/ModelName"
    }
  }
}

Note: TabbyAPI uses model_dir + model_name format. The router path should be the subdirectory only (e.g., "exl2/ModelName").

Monitoring

# Check router
sudo systemctl status llm-router.service

# Check backends
sudo systemctl status sglang.service
sudo systemctl status llamacpp.service
sudo systemctl status tabbyapi.service

# View logs
sudo journalctl -u llm-router.service -f

# GPU usage
nvidia-smi

Troubleshooting

Router not responding

# Check if running
sudo systemctl status llm-router.service

# View logs
sudo journalctl -u llm-router.service -n 50

# Restart
sudo systemctl restart llm-router.service

Backend not loading

# Check backend status
sudo systemctl status tabbyapi.service

# Test backend directly
curl http://localhost:5000/health  # TabbyAPI
curl http://localhost:30000/health # SGLang
curl http://localhost:8085/health  # llama.cpp

Model switching fails

Check MODEL_LOAD_TIMEOUT in config (default: 300s)
Verify model paths are correct
Check GPU has sufficient memory (nvidia-smi)

TabbyAPI auth errors Router automatically reads api_tokens.yml. Ensure it exists:

cat $TABBY_TOKENS_PATH
# Should have both admin_key and api_key
# Default path: /opt/TabbyAPI/api_tokens.yml

Advanced

Multiple GPUs

TabbyAPI: Set gpu_split_auto: true in config.yml
SGLang: Use --tensor-parallel-size N
llama.cpp: Use -ngl 999 to offload all layers

Custom Parameters

Edit backend startup commands in systemd service files.

Blackwell GPU Support

Tested on NVIDIA RTX PRO 6000 Blackwell. Use these build flags:

SGLang: TORCH_CUDA_ARCH_LIST="8.9;9.0"
llama.cpp: CMAKE_CUDA_ARCHITECTURES="89;90"
TabbyAPI: Use pre-built venv with flash-attn (see docs)

Architecture

Client → Router (8002)
           ↓
    ┌──────┴──────┐
    ↓      ↓      ↓
 SGLang  llama  TabbyAPI
 (30000) (8085) (5000)
   AWQ    GGUF    EXL2

Contributing

Contributions welcome! Please open an issue or PR on GitHub.

License

MIT License

Acknowledgments

SGLang - Fast AWQ inference
llama.cpp - GGUF support
TabbyAPI - EXL2 backend

Version: 4.0.0
Status: Production Ready ✅
Tested: NVIDIA Blackwell GPUs
Last Updated: October 26, 2025

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Preserved Backends		Preserved Backends
config		config
docs		docs
scripts		scripts
systemd		systemd
.gitignore		.gitignore
BLACKWELL_GPU.md		BLACKWELL_GPU.md
CHANGELOG.md		CHANGELOG.md
CONFIGURATION.md		CONFIGURATION.md
LICENSE		LICENSE
README.md		README.md
SGLANG_BLACKWELL_INSTALL.md		SGLANG_BLACKWELL_INSTALL.md
WARP.md		WARP.md
install.sh		install.sh
manage-models.sh		manage-models.sh
router.py		router.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🛠️ Installation Scripts

Interactive Installer (`install.sh`)

Model Management Script (`manage-models.sh`)

Configuration

Router not responding

About

Uh oh!

Releases

Packages

Languages

License

darkmaniac7/LLM-Model-Router

Folders and files

Latest commit

History

Repository files navigation

🛠️ Installation Scripts

Interactive Installer (install.sh)

Model Management Script (manage-models.sh)

Configuration

Router not responding

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Interactive Installer (`install.sh`)

Model Management Script (`manage-models.sh`)

Packages