Skip to content

scasella/gemma4-m4-pro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gemma 4 26B A4B on a 24 GB M4 Pro

28.50 → 57.01 tok/s on tuned Hypura · 19.11 tok/s on the low-memory Flash-MoE resident server · 4.32 GB warm memory footprint for the lightweight path

Release Readiness

This repo started as a local bring-up effort for running gemma-4-26B-A4B-it on a 24 GB M4 Pro MacBook Pro. The main result is two validated winners from that research: a tuned Hypura path for raw speed and a resident Flash-MoE path for lower memory pressure.

This public repo keeps the research outcome, the curated measurements, and the front-door commands without bundling the local-only model files and sidecars from the private workspace.

Performance at a glance

All numbers below were measured during the original 24 GB M4 Pro research runs with the local gemma-4-26B-A4B-it-Q4_K_M.gguf file.

Runtime How it is used Generation speed Prompt speed First answer Load time Lowest free memory Swap growth Warm resident memory
Hypura tuned resident server 57.01 tok/s 67.44 tok/s 1025.7 ms 1.97 s 9.33 GB 0.00 GB about 12.51 GB
Flash-MoE one-shot CLI fallback 13.90 tok/s 10.90 tok/s 71.9 ms 18.69 s 8.81 GB 0.00 GB n/a
Flash-MoE resident server 19.11 tok/s 22.36 tok/s 3048.2 ms 9.19 s 10.70 GB 0.00 GB about 4.32 GB

What improved during the research

The charts below separate adopted winners from exploratory-only spikes. The fast Hypura path has one one-pass peak above the adopted winner, but the saved export stays on the three-pass validated result because the follow-up reruns were less stable.

Overall frontier

flowchart LR
  A["Usable llama.cpp baseline<br/>28.50 tok/s"] --> B["Hypura server path unlocked<br/>52.06 tok/s"]
  B --> C["Early safe Hypura run<br/>54.53 tok/s"]
  C --> D["Validated Hypura winner<br/>57.01 tok/s<br/>0.00 GB swap"]
  A --> E["Validated Flash-MoE CLI alternate<br/>13.90 tok/s"]
  E --> F["Flash-MoE resident server<br/>19.11 tok/s<br/>4.32 GB warm RSS"]
Loading

Hypura optimization ladder

flowchart LR
  A["llama.cpp anchor<br/>28.50 tok/s"] --> B["Hypura server bench smoke<br/>52.06 tok/s"]
  B --> C["Memory reserve respected<br/>54.53 tok/s"]
  C --> D["Context 4096 adopted<br/>55.44 tok/s"]
  D --> E["Prompt-side worker tuning<br/>56.67 tok/s"]
  E --> F["Micro-batch 256<br/>56.95 tok/s"]
  F --> G["Adopted 3-pass winner<br/>57.01 tok/s"]
  F -. "one-pass peak only" .-> H["Exploratory peak<br/>57.94 tok/s"]
Loading
Hypura step What changed Result Why it mattered
Usable anchor Started from the first working llama.cpp baseline on this laptop. 28.50 tok/s Set the “beat this” line for every later optimization.
Hypura server path Moved the fast path onto Hypura’s resident server flow. 52.06 tok/s Unlocked the big jump in throughput immediately.
Memory reserve Made Hypura respect real headroom instead of assuming the whole machine was free. 54.53 tok/s Improved speed without giving up the comfort margin.
Context 4096 Raised context from the earlier strict setup to 4096. 55.44 tok/s Kept the faster path practical for real prompts.
Prompt tuning Settled on the stronger prompt-side worker count. 56.67 tok/s This was a larger gain than just chasing main thread count.
Micro-batch tuning Landed on ubatch 256 inside the stronger combo. 56.95 tok/s Preserved speed while staying inside the memory rules.
Adopted winner Locked the three-pass stable export at threads 10, threads_batch 14, batch 512, ubatch 256. 57.01 tok/s This is the official saved winner because it repeated cleanly.

The exploratory threads_batch=13 probe reached 57.94 tok/s, about 1.6% above the adopted winner, but it did not become the exported best because the repeat and current-state reruns were inconsistent.

Flash-MoE optimization ladder

flowchart LR
  A["First slot-bank smoke<br/>12.20 tok/s"] --> B["Bank and GPU sweep winner<br/>14.60 tok/s"]
  B --> C["Validated 3-pass CLI fallback<br/>13.90 tok/s"]
  C -. "one-pass CLI peak" .-> D["Exploratory CLI peak<br/>15.10 tok/s"]
  C --> E["Resident-server shift<br/>19.11 tok/s"]
Loading
Flash-MoE step What changed Result Why it mattered
First smoke run Proved the Gemma 4 Flash-MoE branch could answer correctly here at all. 12.20 tok/s Turned the low-memory path from an idea into a measurable backend.
Bank and GPU sweep The best one-pass sweep favored slot-bank 16 with the dense and shared path left on the CPU. 14.60 tok/s Showed that CPU dense/shared was the right direction on this Mac.
Validated CLI fallback Locked the three-pass one-shot fallback. 13.90 tok/s Gave the project a repeatable lower-pressure alternate.
Resident-server shift Kept Flash-MoE hot instead of paying full startup every prompt. 19.11 tok/s This was the change that made Flash-MoE practically useful day to day.

The resident-server shift improved the low-memory path from 13.90 to 19.11 tok/s, about a 37.5% gain, while also raising the measured free-memory floor from 8.81 GB to 10.70 GB.

The raw numbers behind these charts are summarized in autoresearch/results/readme_research_summary.json. For the full side-by-side runtime write-up, start with autoresearch/results/runtime_comparison.md.

Why this implementation beats the others

These comparisons are all from the saved measurements on this exact 24 GB M4 Pro setup.

Comparison Measured advantage Why it wins
Tuned Hypura vs usable llama.cpp baseline 2.00x generation speed, 16.38x faster load, +5.18 GB more free memory, and 0.70 GB less swap growth The final fast path is not just quicker. It is also much safer to leave running on this laptop.
Tuned Hypura vs validated Flash-MoE CLI 4.10x generation speed, 6.19x prompt speed, and 9.48x faster load Hypura is the right default when raw speed matters most.
Flash-MoE resident vs Flash-MoE CLI 1.37x generation speed, 2.05x prompt speed, 2.03x faster load, and +1.89 GB more free memory The resident-server design is why Flash-MoE became a practical low-memory alternate instead of just a one-shot fallback.
Hypura resident vs Flash-MoE resident 2.98x generation speed and 3.02x prompt speed for Hypura, but Flash-MoE uses only about 4.32 GB warm RSS versus Hypura’s 12.51 GB The project beats single-path setups because it keeps both winners: one for speed and one for memory pressure.

That is the real advantage of this implementation over the other paths in the repo. It does not force one compromise. The front-door commands can use the fastest validated runtime when the machine has room, and they can fall back to the lighter resident-server path when keeping memory free matters more.

Where to start

If you want to use the model:

If you want to understand the research state:

If you want to prepare a release or update this public repo:

How This Repo Is Organized

If you are just trying to use the model, start in autoresearch/. If you are trying to understand how the fast runtime was tuned, also read hypura-main/GEMMA4_M4_PRO.md.

Quick commands

One prompt from the project root:

./try_gemma4.sh "Tell me a short sentence about Paris."

Interactive chat from the project root:

./chat_gemma4.sh

Those wrappers default to the same automatic runtime choice as the lower-level autoresearch commands, so they pick the fast path when the machine has room and the lighter path when memory is tighter.

One prompt, automatic runtime choice:

cd autoresearch
./gemma4_answer.sh --mode auto "Tell me a short sentence about Paris."

Interactive chat:

cd autoresearch
python3 gemma4_chat.py --mode auto

Release preflight:

cd autoresearch
python3 release_readiness_check.py

Current status

Two runtime styles are available:

The benchmark and status tooling for those live under autoresearch/.

Lean public repo note

This GitHub release repo is intentionally lean.

It includes:

  • the tuned Hypura source tree used by the fast path
  • the benchmark and control scripts
  • curated benchmark summaries and the small set of run artifacts needed to support them

It does not include:

  • local model files
  • the huge Flash-MoE sidecar data
  • the full raw run-log archive
  • the optional Flash-MoE source checkout

If you want the lower-memory Flash-MoE path in this lean repo, follow SETUP_EXTERNALS.md and clone the optional runtime into the expected local path.

Releases

No releases published

Packages

 
 
 

Contributors