This project implements an Adaptive Compute system for LLMs on Apple Silicon using the MLX framework.
It intelligently routes user queries between a small model (1B) and a large model (8B) to optimize for both latency and intelligence. For complex tasks, it leverages Speculative Decoding to accelerate the 8B model's generation speed.
-
Smart Routing (Adaptive Compute)
- Uses Llama-3.2-1B as a router to classify user intent.
- Simple Queries: Handled instantly by the 1B model (Fast Mode).
- Complex Queries: Routed to the 8B model (Expert Mode).
-
Speculative Decoding (Speedup)
- When the 8B model is active, the 1B model acts as a "Drafter".
- This provides a ~1.5x to 1.7x speed boost compared to running the 8B model alone, without any loss in quality.
-
Hybrid Guardrails
- Combines LLM-based decision making with keyword-based safety checks to ensure coding tasks are always handled by the Expert model.
graph TD
A[User Input] --> B[Router /1B Model]
B -->|Classify Intent| C{Is Simple?}
C -- Yes --> D[Fast Mode: 1B Generates Answer]
C -- No --> E[Expert Mode: 8B + 1B Speculative]
B -.->|Drafting Support| E
D --> F[Final Output]
E --> F
- Hardware: Apple Silicon Mac (M1/M2/M3/M4)
- Memory: Minimum 16GB Unified Memory recommended
- Software: Python 3.11+,
mlx-lm
- Clone this repository
git clone https://github.com/uqer1244/MLX-Smart-Router-Speculative-Decoding.git
cd mlx-smart-router
- Install dependencies
pip install mlx-lm
- Download Models You need to download the converted MLX models locally or use Hugging Face paths.
- Expert:
mlx-community/Meta-Llama-3.1-8B-Instruct-4bit - Router/Drafter:
mlx-community/Llama-3.2-1B-Instruct-4bit
Run the following commands to download the quantized models locally:
# Create model directory
mkdir -p mlx_models && cd mlx_models
# Download Expert (8B)
git clone https://huggingface.co/mlx-community/Meta-Llama-3.1-8B-Instruct-4bit
# Download Router (1B)
git clone https://huggingface.co/mlx-community/Llama-3.2-1B-Instruct-4bit
Update the model paths in test_speculative.py and run:
python test_speculative.py
| Mode | Task | Speed (TPS) | Note |
|---|---|---|---|
| Fast Mode | Simple Chat | 100+ | Processed by 1B |
| Standard | Coding | ~30 | 8B Only |
| Speculative | Coding | ~48 | 8B + 1B Draft |
Note: Speculative decoding achieves approximately 1.6x speedup on coding tasks by leveraging the 1B model's ability to predict common syntax patterns.
This project is licensed under the MIT License.