MLX LLM Benchmark Suite for Apple Silicon

Comprehensive benchmarking tools to measure LLM inference performance on Apple Silicon Macs, with metrics directly comparable to HuggingFace model cards.

What This Measures

Metric	Description	Model Card Equivalent
TTFT	Time to First Token (ms)	"Time to first token"
Generation Speed	Tokens per second during generation	"Tokens/sec", "Generation speed"
Prompt Processing	Tokens per second for prompt evaluation	"Prompt processing", "Prefill speed"
Peak Memory	Maximum memory usage (GB)	"Memory usage", "VRAM required"
Latency Percentiles	P50/P95/P99 per-token latency	"Latency distribution"

Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
Python 3.10+
16GB+ unified memory (36GB recommended for larger models)

Quick Start

Option 1: MLX (Recommended for Apple Silicon)

# Install dependencies
pip install -r requirements.txt

# Run benchmark with default 4 models
python benchmark.py

# Quick benchmark (fewer iterations)
python benchmark.py --quick

# Test specific models
python benchmark.py --models \
    mlx-community/Llama-3.2-3B-Instruct-4bit \
    mlx-community/Qwen3-4B-4bit \
    mlx-community/gemma-2-9b-it-4bit \
    mlx-community/Meta-Llama-3-8B-Instruct-4bit

# Export results for your blog
python benchmark.py --export markdown --output results.md

Option 2: Ollama (Simpler Setup)

# Install Ollama
brew install ollama

# Start Ollama service
ollama serve

# Run benchmark
python ollama_benchmark.py

# With custom models
python ollama_benchmark.py --models llama3.2:3b qwen2.5:7b gemma2:9b mistral:7b

Option 3: Accurate Per-Token Timing

For the most precise measurements (TTFT, latency percentiles):

python accurate_benchmark.py --models \
    mlx-community/Llama-3.2-3B-Instruct-4bit \
    mlx-community/Qwen3-4B-4bit

# Export to markdown for blog posts
python accurate_benchmark.py --export results.md

Understanding the Results

Sample Output

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Benchmarking: mlx-community/Qwen3-4B-4bit
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Target prompt length: ~4096 tokens
  Loading model: Qwen3-4B-4bit
  Prompt tokens: 4102
  Target generation: 128 tokens
  Warming up...
  Benchmarking with per-token timing...

  Results:
     • TTFT: 892.3 ms
     • Generation: 45.2 tok/s
     • Prompt Processing: 4598.1 tok/s
     • Avg Latency: 22.1 ms/tok
     • Peak Memory: 3.24 GB

Comparing to Model Cards

Model cards on HuggingFace typically report:

Generation Speed: Our "Gen tok/s" metric
- Model cards often test on specific hardware (usually NVIDIA GPUs)
- Apple Silicon typically achieves 60-80% of high-end GPU throughput
- Example: If card says "80 tok/s on A100", expect ~50-65 tok/s on M4 Max
Time to First Token (TTFT): Our "TTFT" metric
- Depends heavily on prompt length
- M4 Max should achieve sub-second TTFT for prompts up to 4K tokens on 7B models
Memory Usage: Our "Peak Memory" metric
- 4-bit quantized models use ~0.5GB per billion parameters
- Add ~1GB overhead for KV cache at long context

Expected Performance (M4 Max 36GB)

Based on Apple's benchmarks and community reports:

Model Size	Quantization	Gen Speed	TTFT (4K prompt)	Memory
3B	4-bit	60-80 tok/s	200-400ms	~2GB
7-8B	4-bit	40-55 tok/s	500-900ms	~5GB
14B	4-bit	25-40 tok/s	1.5-2.5s	~9GB
27B	4-bit	15-25 tok/s	3-5s	~16GB

Recommended Models for M4 Max 36GB

Fast & Efficient (< 5GB)

mlx-community/Llama-3.2-3B-Instruct-4bit
mlx-community/Qwen3-4B-4bit
mlx-community/Ministral-3-3B-Instruct-2512-4bit

Balanced Performance (5-10GB)

mlx-community/Meta-Llama-3-8B-Instruct-4bit
mlx-community/gemma-2-9b-it-4bit
mlx-community/Mistral-7B-Instruct-v0.3-4bit

Maximum Quality (10-20GB)

mlx-community/DeepSeek-R1-Distill-Qwen-14B-4bit
mlx-community/gemma-3-12b-it-qat-4bit
mlx-community/Devstral-Small-2-24B-Instruct-2512-4bit

Push the Limits (20GB+)

mlx-community/gemma-3-27b-it-qat-4bit (~16GB)

Using Results in Your Newsletter

The benchmark exports results in multiple formats:

# Markdown (great for blog posts)
python benchmark.py --export markdown --output benchmark_results.md

# CSV (for spreadsheets/analysis)
python benchmark.py --export csv --output benchmark_results.csv

# JSON (for programmatic use)
python benchmark.py --export json --output benchmark_results.json

Sample Markdown Output

## Performance Summary

| Model | Quant | Prompt | TTFT (ms) | Gen (tok/s) | Memory (GB) |
|-------|-------|--------|-----------|-------------|-------------|
| Llama-3.2-3B-Instruct-4bit | 4-bit | 4102 | 312.5 | 72.3 | 2.14 |
| Qwen3-4B-4bit | 4-bit | 4098 | 445.2 | 58.7 | 2.89 |
| gemma-2-9b-it-4bit | 4-bit | 4105 | 892.1 | 42.5 | 5.67 |

Troubleshooting

"MLX not found"

pip install mlx mlx-lm

"Model not found"

Models are downloaded from HuggingFace. Ensure you have internet access and sufficient disk space (~5-20GB per model).

"Out of memory"

Use smaller models or more aggressive quantization
Close other applications
Try 4-bit instead of 8-bit quantization

Slow Performance

Ensure no other heavy processes are running
Check Activity Monitor for GPU utilization
Use --quick flag for faster (less accurate) benchmarks

Resources

License

MIT License - Feel free to use these benchmarks in your articles and newsletters!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
accurate_benchmark.py		accurate_benchmark.py
benchmark.py		benchmark.py
compare_models.py		compare_models.py
ollama_benchmark.py		ollama_benchmark.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLX LLM Benchmark Suite for Apple Silicon

What This Measures

Requirements

Quick Start

Option 1: MLX (Recommended for Apple Silicon)

Option 2: Ollama (Simpler Setup)

Option 3: Accurate Per-Token Timing

Understanding the Results

Sample Output

Comparing to Model Cards

Expected Performance (M4 Max 36GB)

Recommended Models for M4 Max 36GB

Fast & Efficient (< 5GB)

Balanced Performance (5-10GB)

Maximum Quality (10-20GB)

Push the Limits (20GB+)

Using Results in Your Newsletter

Sample Markdown Output

Troubleshooting

"MLX not found"

"Model not found"

"Out of memory"

Slow Performance

Resources

License

About

Uh oh!

Releases

Packages

Languages

Travis-ML/mlx-benchmark

Folders and files

Latest commit

History

Repository files navigation

MLX LLM Benchmark Suite for Apple Silicon

What This Measures

Requirements

Quick Start

Option 1: MLX (Recommended for Apple Silicon)

Option 2: Ollama (Simpler Setup)

Option 3: Accurate Per-Token Timing

Understanding the Results

Sample Output

Comparing to Model Cards

Expected Performance (M4 Max 36GB)

Recommended Models for M4 Max 36GB

Fast & Efficient (< 5GB)

Balanced Performance (5-10GB)

Maximum Quality (10-20GB)

Push the Limits (20GB+)

Using Results in Your Newsletter

Sample Markdown Output

Troubleshooting

"MLX not found"

"Model not found"

"Out of memory"

Slow Performance

Resources

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages