Welcome to the Gemma 4 Lightweight Installer repository — a universal tool for running Google DeepMind's most capable open-weight models natively on your device, in a single click.
Gemma 4 represents a massive leap in accessible AI, bridging the gap between server-grade performance and local execution. Built for advanced reasoning and agentic workflows, it delivers an unprecedented level of intelligence-per-parameter.
Why do you need this installer right now?
- Zero-Setup: No terminals, Python dependencies, or environment conflicts. Download the
.exe— launch the UI. - 100% Offline & Private: No cloud APIs or rate limits. Your codebase, prompts, and data never leave your machine.
- State-of-the-Art Power: Run the 26B MoE or 31B Dense models directly on your hardware, turning your PC into a local-first AI coding assistant and reasoning engine.
Our installer packages a highly optimized C++ inference engine and a sleek interface so you can focus on building, not configuring.
Gemma 4 moves beyond simple chat to handle complex logic, vision, and agentic tasks. Here is what you get locally:
- Configurable Thinking Mode: A built-in reasoning engine that allows the model to "think" step-by-step before answering complex logical or mathematical problems.
- Massive 256K Context Window: Feed it entire codebases, massive documents, or long chat histories without losing context.
- Extended Multimodality: Natively processes text and images with variable aspect ratio and resolution support, excelling at visual tasks like OCR and chart understanding.
- Agentic Workflows & Tool Use: Features native function-calling support to power highly capable autonomous agents and integrate with developer tools.
- High-Quality Code Generation: Delivers exceptional performance in offline coding and algorithmic optimization, directly on your workstation.
At the core of our local execution is a custom build of llama.cpp. This guarantees:
- Maximum performance and resource utilization of modern hardware (NVIDIA CUDA, AMD ROCm, Apple Metal).
- Intelligent load distribution with mandatory offloading of heavy model layers to your discrete graphics card.
- Support for quantized GGUF formats for extreme compression, allowing frontier models to fit into consumer VRAM.
We provide two separate installers tailored for different workstation capabilities. Both options run 100% locally.
Exceptionally fast tokens-per-second with advanced reasoning.
- Architecture: A 26-billion parameter Mixture-of-Experts (MoE) model that only activates 4 billion parameters during inference.
- Best for: High-throughput tasks, fast conversational agents, and rapid code generation.
- Hardware Requirements: 16 to 32 GB of system RAM. A discrete graphics card (e.g., NVIDIA RTX 3080/4070 or AMD RX 6800+) with at least 12-16GB VRAM is highly recommended.
Maximum raw reasoning quality and foundational power.
- Architecture: A powerful 31-billion parameter dense model.
- Best for: Deep logical analysis, complex mathematical reasoning, and heavy multi-step autonomous coding tasks.
- Hardware Requirements: 32 GB of system RAM minimum. A high-end discrete graphics card (e.g., NVIDIA RTX 3090/4090 or Mac Studio) with 24GB+ VRAM is strictly mandatory for stable operation.
- Go to the Releases section of this repository.
- Download the installer that matches your hardware:
- Download
Gemma4-26B-x64.exefor a balanced, high-speed MoE experience. - Download
Gemma4-31B-x64.exefor maximum reasoning capabilities.
- Download
- Run the downloaded
.exefile. (Note: The installer will automatically download the required quantized GGUF model weights during setup). - Launch the desktop shortcut and start building with Gemma 4!
1. Do I need an internet connection to use these models? Only during the initial installation to download the engine and model weights. Once installed, Gemma 4 runs 100% offline. Your data is completely private.
2. What is the difference between the 26B MoE and 31B Dense versions? The 26B MoE (Mixture of Experts) is optimized for latency; it holds 26 billion parameters in memory but only uses 4 billion for each word it generates, making it incredibly fast. The 31B is a "Dense" model that uses all 31 billion parameters for every word, providing deeper reasoning at the cost of slower generation speeds and higher memory requirements.
3. Is a powerful graphics card (GPU) mandatory?
Yes, absolutely. Because we are deploying the heavy 26B and 31B enterprise-grade models, a high-end or upper-mid-range discrete GPU is strictly required to offload the llama.cpp computations to VRAM. Integrated graphics (iGPU) will not work.
4. Can I use vision features (uploading images) in this local installer? Yes! Both the 26B and 31B versions support native multimodal processing. You can upload images into the local chat interface for OCR, analysis, and visual reasoning.
5. How much hard drive space do I need? The quantized GGUF weights for the 26B model require approximately 16-18 GB of storage, while the 31B model requires about 19-21 GB. We highly recommend installing them on a fast NVMe SSD.
This installer project is distributed under the MIT License. You are free to use, modify, and distribute the installer software.
Note: The Gemma 4 model architectures, weights, and brand names belong to Google DeepMind and are released under the open Apache 2.0 License. By downloading the models via this installer, you agree to Google's usage terms. See the
LICENSEfile for more details.