Comprehensive benchmarking tools to measure LLM inference performance on Apple Silicon Macs, with metrics directly comparable to HuggingFace model cards.
| Metric | Description | Model Card Equivalent |
|---|---|---|
| TTFT | Time to First Token (ms) | "Time to first token" |
| Generation Speed | Tokens per second during generation | "Tokens/sec", "Generation speed" |
| Prompt Processing | Tokens per second for prompt evaluation | "Prompt processing", "Prefill speed" |
| Peak Memory | Maximum memory usage (GB) | "Memory usage", "VRAM required" |
| Latency Percentiles | P50/P95/P99 per-token latency | "Latency distribution" |
- macOS with Apple Silicon (M1/M2/M3/M4)
- Python 3.10+
- 16GB+ unified memory (36GB recommended for larger models)
# Install dependencies
pip install -r requirements.txt
# Run benchmark with default 4 models
python benchmark.py
# Quick benchmark (fewer iterations)
python benchmark.py --quick
# Test specific models
python benchmark.py --models \
mlx-community/Llama-3.2-3B-Instruct-4bit \
mlx-community/Qwen3-4B-4bit \
mlx-community/gemma-2-9b-it-4bit \
mlx-community/Meta-Llama-3-8B-Instruct-4bit
# Export results for your blog
python benchmark.py --export markdown --output results.md# Install Ollama
brew install ollama
# Start Ollama service
ollama serve
# Run benchmark
python ollama_benchmark.py
# With custom models
python ollama_benchmark.py --models llama3.2:3b qwen2.5:7b gemma2:9b mistral:7bFor the most precise measurements (TTFT, latency percentiles):
python accurate_benchmark.py --models \
mlx-community/Llama-3.2-3B-Instruct-4bit \
mlx-community/Qwen3-4B-4bit
# Export to markdown for blog posts
python accurate_benchmark.py --export results.md━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Benchmarking: mlx-community/Qwen3-4B-4bit
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Target prompt length: ~4096 tokens
Loading model: Qwen3-4B-4bit
Prompt tokens: 4102
Target generation: 128 tokens
Warming up...
Benchmarking with per-token timing...
Results:
• TTFT: 892.3 ms
• Generation: 45.2 tok/s
• Prompt Processing: 4598.1 tok/s
• Avg Latency: 22.1 ms/tok
• Peak Memory: 3.24 GB
Model cards on HuggingFace typically report:
-
Generation Speed: Our "Gen tok/s" metric
- Model cards often test on specific hardware (usually NVIDIA GPUs)
- Apple Silicon typically achieves 60-80% of high-end GPU throughput
- Example: If card says "80 tok/s on A100", expect ~50-65 tok/s on M4 Max
-
Time to First Token (TTFT): Our "TTFT" metric
- Depends heavily on prompt length
- M4 Max should achieve sub-second TTFT for prompts up to 4K tokens on 7B models
-
Memory Usage: Our "Peak Memory" metric
- 4-bit quantized models use ~0.5GB per billion parameters
- Add ~1GB overhead for KV cache at long context
Based on Apple's benchmarks and community reports:
| Model Size | Quantization | Gen Speed | TTFT (4K prompt) | Memory |
|---|---|---|---|---|
| 3B | 4-bit | 60-80 tok/s | 200-400ms | ~2GB |
| 7-8B | 4-bit | 40-55 tok/s | 500-900ms | ~5GB |
| 14B | 4-bit | 25-40 tok/s | 1.5-2.5s | ~9GB |
| 27B | 4-bit | 15-25 tok/s | 3-5s | ~16GB |
mlx-community/Llama-3.2-3B-Instruct-4bitmlx-community/Qwen3-4B-4bitmlx-community/Ministral-3-3B-Instruct-2512-4bit
mlx-community/Meta-Llama-3-8B-Instruct-4bitmlx-community/gemma-2-9b-it-4bitmlx-community/Mistral-7B-Instruct-v0.3-4bit
mlx-community/DeepSeek-R1-Distill-Qwen-14B-4bitmlx-community/gemma-3-12b-it-qat-4bitmlx-community/Devstral-Small-2-24B-Instruct-2512-4bit
mlx-community/gemma-3-27b-it-qat-4bit(~16GB)
The benchmark exports results in multiple formats:
# Markdown (great for blog posts)
python benchmark.py --export markdown --output benchmark_results.md
# CSV (for spreadsheets/analysis)
python benchmark.py --export csv --output benchmark_results.csv
# JSON (for programmatic use)
python benchmark.py --export json --output benchmark_results.json## Performance Summary
| Model | Quant | Prompt | TTFT (ms) | Gen (tok/s) | Memory (GB) |
|-------|-------|--------|-----------|-------------|-------------|
| Llama-3.2-3B-Instruct-4bit | 4-bit | 4102 | 312.5 | 72.3 | 2.14 |
| Qwen3-4B-4bit | 4-bit | 4098 | 445.2 | 58.7 | 2.89 |
| gemma-2-9b-it-4bit | 4-bit | 4105 | 892.1 | 42.5 | 5.67 |pip install mlx mlx-lmModels are downloaded from HuggingFace. Ensure you have internet access and sufficient disk space (~5-20GB per model).
- Use smaller models or more aggressive quantization
- Close other applications
- Try 4-bit instead of 8-bit quantization
- Ensure no other heavy processes are running
- Check Activity Monitor for GPU utilization
- Use
--quickflag for faster (less accurate) benchmarks
- MLX Documentation
- mlx-community Models
- Apple's MLX Benchmark Post
- llama.cpp Apple Silicon Performance
MIT License - Feel free to use these benchmarks in your articles and newsletters!