Personalize Gemma 4. Make it real-time. Run it locally.
Quick Start Β β’Β Guides Β β’Β Benchmarks Β β’Β Architecture Β β’Β Website
Fine-tune Google's Gemma 4 on your own conversations (iMessage, Facebook Messenger, etc.) and serve it at real-time voice speeds on Apple Silicon. No cloud. No API keys. Your data never leaves your machine.
Measured on M4 Max (128GB unified memory), Gemma 4 E4B 4-bit quantization:
==========================================================================================
HEAD-TO-HEAD BENCHMARK RESULTS β Gemma 4 E4B on Apple Silicon
==========================================================================================
Backend TTFT P50 TTFT P95 TPS P50 TPS Mean Verdict
ββββββββββββββββββββββββββ ββββββββ ββββββββ βββββββ βββββββ ββββββββ
MLX Server (mlx_lm) 154ms 889ms 111.6 111.4 REAL-TIME
Ollama (Go + llama.cpp) 141ms 150ms 107.9 102.7 REAL-TIME
llama.cpp Metal 136ms 141ms 94.0 94.1 REAL-TIME
All backends generate text 20-30x faster than human speech (RTF < 0.05).
==========================================================================================
111 tokens/sec with a personalized fine-tuned model. That's Gemini Live territory β running locally on your Mac.
You want an AI that sounds like you β not a generic chatbot. And you want it fast enough for real-time voice conversation. Until now, that required cloud APIs and sending your data to someone else's servers.
Your Messages βββ Fine-Tune βββ Serve Locally βββ Real-Time Voice
(iMessage) (5 min) (111 tok/s) (your style)
(Facebook)
(WhatsApp)
gemma-realtime gives you the complete pipeline:
- Extract your conversations from iMessage, Facebook, or any messaging platform
- Fine-tune Gemma 4 with LoRA in minutes on Apple Silicon
- Serve at real-time speeds through your choice of optimized backend
- Benchmark to prove it meets voice latency targets
# Install
pip install mlx mlx-lm
# Extract your data (pick one or both)
python3 scripts/extract_imessage_pairs.py
python3 scripts/extract-facebook.py --export ~/Downloads/facebook-export
# Prepare training data
python3 scripts/prepare-training-data.py --voice
# Fine-tune (5-15 min on M4 Max)
python3 scripts/finetune-gemma.py --target e4b --data data/finetune
# Serve with real-time optimizations
python3 scripts/mlx-server.py \
--model mlx-community/gemma-4-e4b-it-4bit \
--adapter-path ~/.human/adapters/persona \
--realtime
# Prove it works
python3 scripts/voice-bench.pyThe MLX server is the default inference backend for h-uman. When ~/.human/config.json exists, the server auto-reads model, adapter, port, and TurboQuant+ settings β no flags needed:
# Start via h-uman (auto-detects gemma-realtime)
~/.human/bin/human-serve.sh start
# Or run mlx-server.py directly (reads ~/.human/config.json)
python3 scripts/mlx-server.pySee Real-Time Serving Guide for config details.
ββββββββββββββββββββββββββββ
β Your Conversations β
β iMessage Β· Facebook Β· β¦ β
ββββββββββββββ¬ββββββββββββββ
β
extract & prepare
β
βΌ
ββββββββββββββββββββββββββββ
β Training Data β
β train.jsonl (JSONL) β
ββββββββββββββ¬ββββββββββββββ
β
LoRA fine-tune (SFT + DPO)
β
βββββββββββββββββββΌββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β E4B (4B) β β E2B (2B) β β 31B (dense) β
β 110 tok/s β β 180 tok/s β β 20 tok/s β
β Voice β β Draft β β Quality β
ββββββββ¬ββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β β
βββββββββ¬ββββββββ β
βΌ βΌ
ββββββββββββββββ ββββββββββββββββ
β MLX Server β β MLX Server β
β Ollama β β (high qual) β
β llama.cpp β ββββββββββββββββ
β vLLM Metal β
ββββββββββββββββ
Real-Time Voice
| Target | Params | Speed | Use Case |
|---|---|---|---|
| E4B | 4B | ~110 tok/s | Real-time voice, daily driver |
| E2B | 2B | ~180 tok/s | Speculative decode draft model |
| 31B | 31B | ~20 tok/s | Highest quality, complex reasoning |
| Backend | Language | Key Advantage | Best For |
|---|---|---|---|
| MLX Server | Python | Highest throughput, LoRA hot-swap | Development |
| Ollama | Go | Zero GIL, most consistent latency | Production |
| llama.cpp | C++ | Lowest TTFT, fused Metal kernels | First-word speed |
| vLLM Metal | Python | Paged attention, continuous batching | Multi-user |
| ANE+GPU | Python | Dual-compute speculative decode | Experimental |
Things we learned making Gemma 4 real-time on Apple Silicon:
Switching from the multimodal mlx_vlm import to text-only mlx_lm gave a 10x speedup (13 β 110+ tok/s). The VLM path adds numpy synchronization overhead even for text-only inference.
Gemma 4 uses ScaledLinear layers that most community quantizations corrupt. Only PLE-safe quants (which skip these layers) produce correct output. The scripts detect and warn about broken models automatically.
Ollama wraps llama.cpp in a Go server β no GIL serialization on token dispatch. Result: the most consistent P50-to-P95 latency spread of any backend.
llama.cpp's fused RoPE+attention shaders give the fastest time-to-first-token. For voice UX, the first word is what the user feels.
TurboQuant+ compresses the KV cache 3.8-6.4x using PolarQuant + Walsh-Hadamard rotation. Integrated via the MLX port β TurboKVCache is a drop-in replacement for mlx-lm's KVCache with zero framework changes. The --realtime flag auto-enables 4-bit TurboQuant+; add --kv-asymmetric for FP16 keys (best quality, less compression).
| Config | Compression | Quality (vs FP16) | Decode Speed |
|---|---|---|---|
| turbo4 (4-bit) | 3.8x | +0.23% PPL | 97-100% baseline |
| turbo3 (3-bit) | 4.6x | +1.06% PPL | 90-93% baseline |
| asymmetric (K=FP16, V=turbo4) | ~2x | +0.51% PPL | 99% baseline |
# Install TurboQuant+ MLX fork
pip install git+https://github.com/TheTom/mlx.git@feature/turboquant-plus
# Real-time mode auto-enables TurboQuant+ 4-bit
python3 scripts/mlx-server.py --model mlx-community/gemma-4-e4b-it-4bit --realtime
# Asymmetric mode (best for Q4_K_M models β preserves K precision)
python3 scripts/mlx-server.py --model mlx-community/gemma-4-e4b-it-4bit --kv-bits 4 --kv-asymmetricApple Silicon has an undocumented matrix coprocessor (AMX on M1-M3, SME2 on M4+) that Accelerate's BLAS uses internally. Direct benchmarking shows 2.5 TFLOPS FP32 β that's every matmul in every transformer layer running at GPU-like speeds on the CPU.
Apple's IOSurface lets CPU, GPU, and ANE access the same physical memory with no memcpy. The hybrid pipeline uses this for zero-copy KV cache: GPU writes during prefill, ANE reads during decode. Measured 5+ TB/s effective bandwidth.
Secret APIs: The Hidden Performance Stack
We reverse-engineered, built, and benchmarked the undocumented hardware features that make real-time LLM inference possible on Apple Silicon:
| Layer | What It Is | Result |
|---|---|---|
| AMX/SME2 | Undocumented CPU matrix coprocessor | 77x over NEON, 2.5 TFLOPS FP32 |
| Neural Engine | Private _ANEClient API (67 classes discovered) |
15.8 TFLOPS FP16, 6.6 TFLOPS/W |
| Direct ANE | Bypass CoreML via _ANEInMemoryModelDescriptor |
In-memory MIL compilation, training proven |
| IOSurface | Zero-copy shared memory across CPU/GPU/ANE | 5+ TB/s effective bandwidth |
| Metal 4 Tensor | MTLTensor + Shader ML + ML Command Encoder | Full CoreML on GPU timeline |
| M5 Neural Accel | Per-GPU-core Neural Accelerators (10-40 units) | 4x peak AI compute vs M4 |
| Metal Dynamic | MTLFunctionConstant kernel specialization | Fused attention for all Gemma configs |
| Hybrid Pipeline | GPU prefill + ANE decode + zero-copy KV cache | 1,333 tok/s, 53x real-time margin |
# Build and run all 8 secret API benchmarks
cd secret-apis && make all && make benchSee Guide 06: Secret APIs for the full deep dive, including the maderix/ANE reverse engineering work that proved training on the Neural Engine is possible.
scripts/
βββ extract_imessage_pairs.py # Extract iMessage conversations (macOS)
βββ extract-facebook.py # Extract Facebook Messenger data
βββ prepare-training-data.py # Combine sources β train/valid splits
βββ finetune-gemma.py # LoRA pipeline (SFT + DPO + quantize)
βββ mlx-server.py # MLX inference server (OpenAI-compatible)
βββ ollama-serve.sh # Ollama serving script
βββ llamacpp-serve.sh # llama.cpp Metal server
βββ vllm-metal-serve.sh # vLLM Metal server
βββ ane-gpu-bridge.py # ANE+GPU dual-compute bridge
βββ voice-bench.py # Single-backend voice benchmark
βββ bench-all-backends.py # Head-to-head comparison
secret-apis/
βββ amx_matmul.c # AMX/SME2 coprocessor benchmark
βββ amx.h # Reverse-engineered AMX instruction encodings
βββ sme2_matmul.c # ARM SME2 detection and benchmark
βββ ane_probe.m # Neural Engine private API discovery
βββ ane_direct.m # Direct ANE access (maderix/ANE findings, 67 classes)
βββ iosurface_bridge.m # IOSurface zero-copy bridge + Metal compute
βββ metal_dynamic.m # Dynamic kernel compilation + fused attention
βββ metal4_tensor.m # Metal 4 Tensor APIs + M5 Neural Accelerator probe
βββ hybrid_pipeline.m # Full GPU+ANE hybrid inference pipeline
βββ bench_all_secrets.sh # Run all benchmarks with report generation
βββ Makefile # Build system
guides/
βββ 01-quickstart.md # Running in 10 minutes
βββ 02-data-preparation.md # iMessage, Facebook, WhatsApp, custom
βββ 03-fine-tuning.md # LoRA deep dive
βββ 04-real-time-serving.md # All 5 backends explained
βββ 05-benchmarking.md # Measuring and interpreting results
βββ 06-secret-apis.md # Apple Silicon secret performance stack
| Hardware | E2B (2B) | E4B (4B) | 31B |
|---|---|---|---|
| Minimum | M1, 8GB | M1 Pro, 16GB | M2 Max, 64GB |
| Recommended | M2+, 16GB | M3 Pro+, 36GB | M4 Max, 128GB |
| Expected TPS | 150-200 | 80-120 | 15-25 |
Apple Silicon's unified memory is the key enabler β GPU reads model weights directly from main memory with no copying.
| Guide | Description |
|---|---|
| Quick Start | Get running in 10 minutes |
| Data Preparation | Extract from iMessage, Facebook, WhatsApp |
| Fine-Tuning | LoRA hyperparameters, DPO, quantization |
| Real-Time Serving | Backend setup and optimization |
| Benchmarking | Measure TTFT, TPS, RTF |
Everything runs locally. Your conversation data, training process, and inference all happen on your Mac:
- Extracted JSONL files contain your messages β treat as sensitive
- The LoRA adapter encodes your communication style β keep private
- No network calls during extraction, training, or inference
- The
.gitignoreexcludes all data and model files by default
Contributions welcome. See CONTRIBUTING.md for guidelines.
Areas where help is most needed:
- More data extractors (Telegram, Discord, Signal, WhatsApp native)
- CoreML/ANE optimization for the draft model
- Windows/Linux support (currently macOS-focused)
- Voice pipeline integration (STT β inference β TTS)
MIT. See LICENSE.
Gemma models are licensed under Google's Gemma Terms of Use.