Skip to content

uqer1244/MLX-Smart-Router-Speculative-Decoding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

⚡️ MLX Smart Router & Speculative Decoding

This project implements an Adaptive Compute system for LLMs on Apple Silicon using the MLX framework.

It intelligently routes user queries between a small model (1B) and a large model (8B) to optimize for both latency and intelligence. For complex tasks, it leverages Speculative Decoding to accelerate the 8B model's generation speed.

Key Features

  1. Smart Routing (Adaptive Compute)

    • Uses Llama-3.2-1B as a router to classify user intent.
    • Simple Queries: Handled instantly by the 1B model (Fast Mode).
    • Complex Queries: Routed to the 8B model (Expert Mode).
  2. Speculative Decoding (Speedup)

    • When the 8B model is active, the 1B model acts as a "Drafter".
    • This provides a ~1.5x to 1.7x speed boost compared to running the 8B model alone, without any loss in quality.
  3. Hybrid Guardrails

    • Combines LLM-based decision making with keyword-based safety checks to ensure coding tasks are always handled by the Expert model.

Architecture

graph TD
    A[User Input] --> B[Router /1B Model]
    B -->|Classify Intent| C{Is Simple?}
    
    C -- Yes --> D[Fast Mode: 1B Generates Answer]
    C -- No --> E[Expert Mode: 8B + 1B Speculative]

    B -.->|Drafting Support| E
    
    D --> F[Final Output]
    E --> F
Loading

💻 Requirements

  • Hardware: Apple Silicon Mac (M1/M2/M3/M4)
  • Memory: Minimum 16GB Unified Memory recommended
  • Software: Python 3.11+, mlx-lm

📦 Installation

  1. Clone this repository
git clone https://github.com/uqer1244/MLX-Smart-Router-Speculative-Decoding.git
cd mlx-smart-router
  1. Install dependencies
pip install mlx-lm
  1. Download Models You need to download the converted MLX models locally or use Hugging Face paths.
  • Expert: mlx-community/Meta-Llama-3.1-8B-Instruct-4bit
  • Router/Drafter: mlx-community/Llama-3.2-1B-Instruct-4bit

Run the following commands to download the quantized models locally:

# Create model directory
mkdir -p mlx_models && cd mlx_models

# Download Expert (8B)
git clone https://huggingface.co/mlx-community/Meta-Llama-3.1-8B-Instruct-4bit

# Download Router (1B)
git clone https://huggingface.co/mlx-community/Llama-3.2-1B-Instruct-4bit

🚀 Usage

Update the model paths in test_speculative.py and run:

python test_speculative.py

📊 Performance Benchmark (on M3 Pro 18GB)

Mode Task Speed (TPS) Note
Fast Mode Simple Chat 100+ Processed by 1B
Standard Coding ~30 8B Only
Speculative Coding ~48 8B + 1B Draft

Note: Speculative decoding achieves approximately 1.6x speedup on coding tasks by leveraging the 1B model's ability to predict common syntax patterns.

📜 License

This project is licensed under the MIT License.

About

An Adaptive Compute system for LLMs on Apple Silicon

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages