Skip to content

MLX LLM Benchmark Suite for Apple Silicon. Comprehensive benchmarking tools to measure LLM inference performance on Apple Silicon Macs, with metrics directly comparable to HuggingFace model cards.

Notifications You must be signed in to change notification settings

Travis-ML/mlx-benchmark

Repository files navigation

MLX LLM Benchmark Suite for Apple Silicon

Comprehensive benchmarking tools to measure LLM inference performance on Apple Silicon Macs, with metrics directly comparable to HuggingFace model cards.

What This Measures

Metric Description Model Card Equivalent
TTFT Time to First Token (ms) "Time to first token"
Generation Speed Tokens per second during generation "Tokens/sec", "Generation speed"
Prompt Processing Tokens per second for prompt evaluation "Prompt processing", "Prefill speed"
Peak Memory Maximum memory usage (GB) "Memory usage", "VRAM required"
Latency Percentiles P50/P95/P99 per-token latency "Latency distribution"

Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4)
  • Python 3.10+
  • 16GB+ unified memory (36GB recommended for larger models)

Quick Start

Option 1: MLX (Recommended for Apple Silicon)

# Install dependencies
pip install -r requirements.txt

# Run benchmark with default 4 models
python benchmark.py

# Quick benchmark (fewer iterations)
python benchmark.py --quick

# Test specific models
python benchmark.py --models \
    mlx-community/Llama-3.2-3B-Instruct-4bit \
    mlx-community/Qwen3-4B-4bit \
    mlx-community/gemma-2-9b-it-4bit \
    mlx-community/Meta-Llama-3-8B-Instruct-4bit

# Export results for your blog
python benchmark.py --export markdown --output results.md

Option 2: Ollama (Simpler Setup)

# Install Ollama
brew install ollama

# Start Ollama service
ollama serve

# Run benchmark
python ollama_benchmark.py

# With custom models
python ollama_benchmark.py --models llama3.2:3b qwen2.5:7b gemma2:9b mistral:7b

Option 3: Accurate Per-Token Timing

For the most precise measurements (TTFT, latency percentiles):

python accurate_benchmark.py --models \
    mlx-community/Llama-3.2-3B-Instruct-4bit \
    mlx-community/Qwen3-4B-4bit

# Export to markdown for blog posts
python accurate_benchmark.py --export results.md

Understanding the Results

Sample Output

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Benchmarking: mlx-community/Qwen3-4B-4bit
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Target prompt length: ~4096 tokens
  Loading model: Qwen3-4B-4bit
  Prompt tokens: 4102
  Target generation: 128 tokens
  Warming up...
  Benchmarking with per-token timing...

  Results:
     • TTFT: 892.3 ms
     • Generation: 45.2 tok/s
     • Prompt Processing: 4598.1 tok/s
     • Avg Latency: 22.1 ms/tok
     • Peak Memory: 3.24 GB

Comparing to Model Cards

Model cards on HuggingFace typically report:

  1. Generation Speed: Our "Gen tok/s" metric

    • Model cards often test on specific hardware (usually NVIDIA GPUs)
    • Apple Silicon typically achieves 60-80% of high-end GPU throughput
    • Example: If card says "80 tok/s on A100", expect ~50-65 tok/s on M4 Max
  2. Time to First Token (TTFT): Our "TTFT" metric

    • Depends heavily on prompt length
    • M4 Max should achieve sub-second TTFT for prompts up to 4K tokens on 7B models
  3. Memory Usage: Our "Peak Memory" metric

    • 4-bit quantized models use ~0.5GB per billion parameters
    • Add ~1GB overhead for KV cache at long context

Expected Performance (M4 Max 36GB)

Based on Apple's benchmarks and community reports:

Model Size Quantization Gen Speed TTFT (4K prompt) Memory
3B 4-bit 60-80 tok/s 200-400ms ~2GB
7-8B 4-bit 40-55 tok/s 500-900ms ~5GB
14B 4-bit 25-40 tok/s 1.5-2.5s ~9GB
27B 4-bit 15-25 tok/s 3-5s ~16GB

Recommended Models for M4 Max 36GB

Fast & Efficient (< 5GB)

  • mlx-community/Llama-3.2-3B-Instruct-4bit
  • mlx-community/Qwen3-4B-4bit
  • mlx-community/Ministral-3-3B-Instruct-2512-4bit

Balanced Performance (5-10GB)

  • mlx-community/Meta-Llama-3-8B-Instruct-4bit
  • mlx-community/gemma-2-9b-it-4bit
  • mlx-community/Mistral-7B-Instruct-v0.3-4bit

Maximum Quality (10-20GB)

  • mlx-community/DeepSeek-R1-Distill-Qwen-14B-4bit
  • mlx-community/gemma-3-12b-it-qat-4bit
  • mlx-community/Devstral-Small-2-24B-Instruct-2512-4bit

Push the Limits (20GB+)

  • mlx-community/gemma-3-27b-it-qat-4bit (~16GB)

Using Results in Your Newsletter

The benchmark exports results in multiple formats:

# Markdown (great for blog posts)
python benchmark.py --export markdown --output benchmark_results.md

# CSV (for spreadsheets/analysis)
python benchmark.py --export csv --output benchmark_results.csv

# JSON (for programmatic use)
python benchmark.py --export json --output benchmark_results.json

Sample Markdown Output

## Performance Summary

| Model | Quant | Prompt | TTFT (ms) | Gen (tok/s) | Memory (GB) |
|-------|-------|--------|-----------|-------------|-------------|
| Llama-3.2-3B-Instruct-4bit | 4-bit | 4102 | 312.5 | 72.3 | 2.14 |
| Qwen3-4B-4bit | 4-bit | 4098 | 445.2 | 58.7 | 2.89 |
| gemma-2-9b-it-4bit | 4-bit | 4105 | 892.1 | 42.5 | 5.67 |

Troubleshooting

"MLX not found"

pip install mlx mlx-lm

"Model not found"

Models are downloaded from HuggingFace. Ensure you have internet access and sufficient disk space (~5-20GB per model).

"Out of memory"

  • Use smaller models or more aggressive quantization
  • Close other applications
  • Try 4-bit instead of 8-bit quantization

Slow Performance

  • Ensure no other heavy processes are running
  • Check Activity Monitor for GPU utilization
  • Use --quick flag for faster (less accurate) benchmarks

Resources

License

MIT License - Feel free to use these benchmarks in your articles and newsletters!

About

MLX LLM Benchmark Suite for Apple Silicon. Comprehensive benchmarking tools to measure LLM inference performance on Apple Silicon Macs, with metrics directly comparable to HuggingFace model cards.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages