Skip to content

sethdford/gemma-realtime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

gemma-realtime

gemma-realtime

Personalize Gemma 4. Make it real-time. Run it locally.

Quick Start Β β€’Β  Guides Β β€’Β  Benchmarks Β β€’Β  Architecture Β β€’Β  Website

Apple Silicon Python MLX MIT License Gemma 4


Fine-tune Google's Gemma 4 on your own conversations (iMessage, Facebook Messenger, etc.) and serve it at real-time voice speeds on Apple Silicon. No cloud. No API keys. Your data never leaves your machine.

Benchmark Results

Measured on M4 Max (128GB unified memory), Gemma 4 E4B 4-bit quantization:

==========================================================================================
  HEAD-TO-HEAD BENCHMARK RESULTS β€” Gemma 4 E4B on Apple Silicon
==========================================================================================

  Backend                        TTFT P50   TTFT P95   TPS P50  TPS Mean    Verdict
  ──────────────────────────  ────────  ────────  ───────  ───────  ────────
  MLX Server (mlx_lm)               154ms      889ms   111.6    111.4   REAL-TIME
  Ollama (Go + llama.cpp)           141ms      150ms   107.9    102.7   REAL-TIME
  llama.cpp Metal                   136ms      141ms    94.0     94.1   REAL-TIME

  All backends generate text 20-30x faster than human speech (RTF < 0.05).
==========================================================================================

111 tokens/sec with a personalized fine-tuned model. That's Gemini Live territory β€” running locally on your Mac.

The Problem

You want an AI that sounds like you β€” not a generic chatbot. And you want it fast enough for real-time voice conversation. Until now, that required cloud APIs and sending your data to someone else's servers.

The Solution

Your Messages ──→ Fine-Tune ──→ Serve Locally ──→ Real-Time Voice
  (iMessage)       (5 min)      (111 tok/s)       (your style)
  (Facebook)
  (WhatsApp)

gemma-realtime gives you the complete pipeline:

  1. Extract your conversations from iMessage, Facebook, or any messaging platform
  2. Fine-tune Gemma 4 with LoRA in minutes on Apple Silicon
  3. Serve at real-time speeds through your choice of optimized backend
  4. Benchmark to prove it meets voice latency targets

Quick Start

# Install
pip install mlx mlx-lm

# Extract your data (pick one or both)
python3 scripts/extract_imessage_pairs.py
python3 scripts/extract-facebook.py --export ~/Downloads/facebook-export

# Prepare training data
python3 scripts/prepare-training-data.py --voice

# Fine-tune (5-15 min on M4 Max)
python3 scripts/finetune-gemma.py --target e4b --data data/finetune

# Serve with real-time optimizations
python3 scripts/mlx-server.py \
  --model mlx-community/gemma-4-e4b-it-4bit \
  --adapter-path ~/.human/adapters/persona \
  --realtime

# Prove it works
python3 scripts/voice-bench.py

h-uman Integration

The MLX server is the default inference backend for h-uman. When ~/.human/config.json exists, the server auto-reads model, adapter, port, and TurboQuant+ settings β€” no flags needed:

# Start via h-uman (auto-detects gemma-realtime)
~/.human/bin/human-serve.sh start

# Or run mlx-server.py directly (reads ~/.human/config.json)
python3 scripts/mlx-server.py

See Real-Time Serving Guide for config details.

Architecture

                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                         β”‚     Your Conversations    β”‚
                         β”‚  iMessage Β· Facebook Β· …  β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
                              extract & prepare
                                      β”‚
                                      β–Ό
                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                         β”‚      Training Data        β”‚
                         β”‚    train.jsonl (JSONL)     β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
                          LoRA fine-tune (SFT + DPO)
                                      β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β–Ό                 β–Ό                  β–Ό
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚  E4B (4B)   β”‚  β”‚  E2B (2B)    β”‚  β”‚  31B (dense) β”‚
             β”‚  110 tok/s  β”‚  β”‚  180 tok/s   β”‚  β”‚  20 tok/s    β”‚
             β”‚  Voice      β”‚  β”‚  Draft       β”‚  β”‚  Quality     β”‚
             β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚               β”‚                  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
                            β–Ό                          β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  MLX Server  β”‚          β”‚  MLX Server  β”‚
                    β”‚  Ollama      β”‚          β”‚  (high qual) β”‚
                    β”‚  llama.cpp   β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚  vLLM Metal  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     Real-Time Voice

Model Targets

Target Params Speed Use Case
E4B 4B ~110 tok/s Real-time voice, daily driver
E2B 2B ~180 tok/s Speculative decode draft model
31B 31B ~20 tok/s Highest quality, complex reasoning

Serving Backends

Backend Language Key Advantage Best For
MLX Server Python Highest throughput, LoRA hot-swap Development
Ollama Go Zero GIL, most consistent latency Production
llama.cpp C++ Lowest TTFT, fused Metal kernels First-word speed
vLLM Metal Python Paged attention, continuous batching Multi-user
ANE+GPU Python Dual-compute speculative decode Experimental

Key Discoveries

Things we learned making Gemma 4 real-time on Apple Silicon:

1. mlx_lm vs mlx_vlm β€” 10x speedup

Switching from the multimodal mlx_vlm import to text-only mlx_lm gave a 10x speedup (13 β†’ 110+ tok/s). The VLM path adds numpy synchronization overhead even for text-only inference.

2. PLE-safe quantization is critical

Gemma 4 uses ScaledLinear layers that most community quantizations corrupt. Only PLE-safe quants (which skip these layers) produce correct output. The scripts detect and warn about broken models automatically.

3. Go eliminates the Python GIL bottleneck

Ollama wraps llama.cpp in a Go server β€” no GIL serialization on token dispatch. Result: the most consistent P50-to-P95 latency spread of any backend.

4. Fused Metal kernels matter for TTFT

llama.cpp's fused RoPE+attention shaders give the fastest time-to-first-token. For voice UX, the first word is what the user feels.

5. TurboQuant+ KV cache compression

TurboQuant+ compresses the KV cache 3.8-6.4x using PolarQuant + Walsh-Hadamard rotation. Integrated via the MLX port β€” TurboKVCache is a drop-in replacement for mlx-lm's KVCache with zero framework changes. The --realtime flag auto-enables 4-bit TurboQuant+; add --kv-asymmetric for FP16 keys (best quality, less compression).

Config Compression Quality (vs FP16) Decode Speed
turbo4 (4-bit) 3.8x +0.23% PPL 97-100% baseline
turbo3 (3-bit) 4.6x +1.06% PPL 90-93% baseline
asymmetric (K=FP16, V=turbo4) ~2x +0.51% PPL 99% baseline
# Install TurboQuant+ MLX fork
pip install git+https://github.com/TheTom/mlx.git@feature/turboquant-plus

# Real-time mode auto-enables TurboQuant+ 4-bit
python3 scripts/mlx-server.py --model mlx-community/gemma-4-e4b-it-4bit --realtime

# Asymmetric mode (best for Q4_K_M models β€” preserves K precision)
python3 scripts/mlx-server.py --model mlx-community/gemma-4-e4b-it-4bit --kv-bits 4 --kv-asymmetric

6. The AMX/SME2 coprocessor is 77x faster than NEON

Apple Silicon has an undocumented matrix coprocessor (AMX on M1-M3, SME2 on M4+) that Accelerate's BLAS uses internally. Direct benchmarking shows 2.5 TFLOPS FP32 β€” that's every matmul in every transformer layer running at GPU-like speeds on the CPU.

7. IOSurface enables zero-copy KV cache sharing

Apple's IOSurface lets CPU, GPU, and ANE access the same physical memory with no memcpy. The hybrid pipeline uses this for zero-copy KV cache: GPU writes during prefill, ANE reads during decode. Measured 5+ TB/s effective bandwidth.

Secret APIs: The Hidden Performance Stack

We reverse-engineered, built, and benchmarked the undocumented hardware features that make real-time LLM inference possible on Apple Silicon:

Layer What It Is Result
AMX/SME2 Undocumented CPU matrix coprocessor 77x over NEON, 2.5 TFLOPS FP32
Neural Engine Private _ANEClient API (67 classes discovered) 15.8 TFLOPS FP16, 6.6 TFLOPS/W
Direct ANE Bypass CoreML via _ANEInMemoryModelDescriptor In-memory MIL compilation, training proven
IOSurface Zero-copy shared memory across CPU/GPU/ANE 5+ TB/s effective bandwidth
Metal 4 Tensor MTLTensor + Shader ML + ML Command Encoder Full CoreML on GPU timeline
M5 Neural Accel Per-GPU-core Neural Accelerators (10-40 units) 4x peak AI compute vs M4
Metal Dynamic MTLFunctionConstant kernel specialization Fused attention for all Gemma configs
Hybrid Pipeline GPU prefill + ANE decode + zero-copy KV cache 1,333 tok/s, 53x real-time margin
# Build and run all 8 secret API benchmarks
cd secret-apis && make all && make bench

See Guide 06: Secret APIs for the full deep dive, including the maderix/ANE reverse engineering work that proved training on the Neural Engine is possible.

Project Structure

scripts/
β”œβ”€β”€ extract_imessage_pairs.py    # Extract iMessage conversations (macOS)
β”œβ”€β”€ extract-facebook.py          # Extract Facebook Messenger data
β”œβ”€β”€ prepare-training-data.py     # Combine sources β†’ train/valid splits
β”œβ”€β”€ finetune-gemma.py            # LoRA pipeline (SFT + DPO + quantize)
β”œβ”€β”€ mlx-server.py                # MLX inference server (OpenAI-compatible)
β”œβ”€β”€ ollama-serve.sh              # Ollama serving script
β”œβ”€β”€ llamacpp-serve.sh            # llama.cpp Metal server
β”œβ”€β”€ vllm-metal-serve.sh          # vLLM Metal server
β”œβ”€β”€ ane-gpu-bridge.py            # ANE+GPU dual-compute bridge
β”œβ”€β”€ voice-bench.py               # Single-backend voice benchmark
└── bench-all-backends.py        # Head-to-head comparison

secret-apis/
β”œβ”€β”€ amx_matmul.c                 # AMX/SME2 coprocessor benchmark
β”œβ”€β”€ amx.h                        # Reverse-engineered AMX instruction encodings
β”œβ”€β”€ sme2_matmul.c                # ARM SME2 detection and benchmark
β”œβ”€β”€ ane_probe.m                  # Neural Engine private API discovery
β”œβ”€β”€ ane_direct.m                 # Direct ANE access (maderix/ANE findings, 67 classes)
β”œβ”€β”€ iosurface_bridge.m           # IOSurface zero-copy bridge + Metal compute
β”œβ”€β”€ metal_dynamic.m              # Dynamic kernel compilation + fused attention
β”œβ”€β”€ metal4_tensor.m              # Metal 4 Tensor APIs + M5 Neural Accelerator probe
β”œβ”€β”€ hybrid_pipeline.m            # Full GPU+ANE hybrid inference pipeline
β”œβ”€β”€ bench_all_secrets.sh         # Run all benchmarks with report generation
└── Makefile                     # Build system

guides/
β”œβ”€β”€ 01-quickstart.md             # Running in 10 minutes
β”œβ”€β”€ 02-data-preparation.md       # iMessage, Facebook, WhatsApp, custom
β”œβ”€β”€ 03-fine-tuning.md            # LoRA deep dive
β”œβ”€β”€ 04-real-time-serving.md      # All 5 backends explained
β”œβ”€β”€ 05-benchmarking.md           # Measuring and interpreting results
└── 06-secret-apis.md            # Apple Silicon secret performance stack

Hardware Requirements

Hardware E2B (2B) E4B (4B) 31B
Minimum M1, 8GB M1 Pro, 16GB M2 Max, 64GB
Recommended M2+, 16GB M3 Pro+, 36GB M4 Max, 128GB
Expected TPS 150-200 80-120 15-25

Apple Silicon's unified memory is the key enabler β€” GPU reads model weights directly from main memory with no copying.

Guides

Guide Description
Quick Start Get running in 10 minutes
Data Preparation Extract from iMessage, Facebook, WhatsApp
Fine-Tuning LoRA hyperparameters, DPO, quantization
Real-Time Serving Backend setup and optimization
Benchmarking Measure TTFT, TPS, RTF

Privacy

Everything runs locally. Your conversation data, training process, and inference all happen on your Mac:

  • Extracted JSONL files contain your messages β€” treat as sensitive
  • The LoRA adapter encodes your communication style β€” keep private
  • No network calls during extraction, training, or inference
  • The .gitignore excludes all data and model files by default

Contributing

Contributions welcome. See CONTRIBUTING.md for guidelines.

Areas where help is most needed:

  • More data extractors (Telegram, Discord, Signal, WhatsApp native)
  • CoreML/ANE optimization for the draft model
  • Windows/Linux support (currently macOS-focused)
  • Voice pipeline integration (STT β†’ inference β†’ TTS)

License

MIT. See LICENSE.

Gemma models are licensed under Google's Gemma Terms of Use.

About

Personalize Gemma 4 and make it real-time on Apple Silicon. Fine-tune on your conversations, serve at 111 tok/s. No cloud required.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors