⚡️ MLX Smart Router & Speculative Decoding

This project implements an Adaptive Compute system for LLMs on Apple Silicon using the MLX framework.

It intelligently routes user queries between a small model (1B) and a large model (8B) to optimize for both latency and intelligence. For complex tasks, it leverages Speculative Decoding to accelerate the 8B model's generation speed.

Key Features

Smart Routing (Adaptive Compute)
- Uses Llama-3.2-1B as a router to classify user intent.
- Simple Queries: Handled instantly by the 1B model (Fast Mode).
- Complex Queries: Routed to the 8B model (Expert Mode).
Speculative Decoding (Speedup)
- When the 8B model is active, the 1B model acts as a "Drafter".
- This provides a ~1.5x to 1.7x speed boost compared to running the 8B model alone, without any loss in quality.
Hybrid Guardrails
- Combines LLM-based decision making with keyword-based safety checks to ensure coding tasks are always handled by the Expert model.

Architecture

graph TD
    A[User Input] --> B[Router /1B Model]
    B -->|Classify Intent| C{Is Simple?}
    
    C -- Yes --> D[Fast Mode: 1B Generates Answer]
    C -- No --> E[Expert Mode: 8B + 1B Speculative]

    B -.->|Drafting Support| E
    
    D --> F[Final Output]
    E --> F

💻 Requirements

Hardware: Apple Silicon Mac (M1/M2/M3/M4)
Memory: Minimum 16GB Unified Memory recommended
Software: Python 3.11+, mlx-lm

📦 Installation

Clone this repository

git clone https://github.com/uqer1244/MLX-Smart-Router-Speculative-Decoding.git
cd mlx-smart-router

Install dependencies

pip install mlx-lm

Download Models You need to download the converted MLX models locally or use Hugging Face paths.

Expert: mlx-community/Meta-Llama-3.1-8B-Instruct-4bit
Router/Drafter: mlx-community/Llama-3.2-1B-Instruct-4bit

Run the following commands to download the quantized models locally:

# Create model directory
mkdir -p mlx_models && cd mlx_models

# Download Expert (8B)
git clone https://huggingface.co/mlx-community/Meta-Llama-3.1-8B-Instruct-4bit

# Download Router (1B)
git clone https://huggingface.co/mlx-community/Llama-3.2-1B-Instruct-4bit

🚀 Usage

Update the model paths in test_speculative.py and run:

python test_speculative.py

📊 Performance Benchmark (on M3 Pro 18GB)

Mode	Task	Speed (TPS)	Note
Fast Mode	Simple Chat	100+	Processed by 1B
Standard	Coding	~30	8B Only
Speculative	Coding	~48	8B + 1B Draft

Note: Speculative decoding achieves approximately 1.6x speedup on coding tasks by leveraging the 1B model's ability to predict common syntax patterns.

📜 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
readme.md		readme.md
test_speculative.py		test_speculative.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡️ MLX Smart Router & Speculative Decoding

Key Features

Architecture

💻 Requirements

📦 Installation

🚀 Usage

📊 Performance Benchmark (on M3 Pro 18GB)

📜 License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚡️ MLX Smart Router & Speculative Decoding

Key Features

Architecture

💻 Requirements

📦 Installation

🚀 Usage

📊 Performance Benchmark (on M3 Pro 18GB)

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages