Python/Mojo interface for Google Gemma 3.
- Embeddings β Dense vector embeddings via a pure Mojo backend.
- Text generation β Synchronous and async streaming with configurable sampling.
- Multimodal Vision β Native support for Gemma 3 Vision models with zero-copy image processing.
- Google Cloud Storage β Automatic model download from Google's
gemma-databucket. - OpenTelemetry β Optional tracing instrumentation.
Recommended for most users:
pip install 'mogemma[llm]'This enables the text generation and embedding examples shown below.
For multimodal generation with automatic image decoding from str, Path, or raw bytes inputs:
pip install 'mogemma[vision]'Base package only:
pip install mogemmaUse the base package if you're already preparing tokens or image arrays yourself.
The default getting-started path is mogemma[llm].
from mogemma import SyncGemmaModel
model = SyncGemmaModel()
print(model.generate("Write a haiku about a robot discovering coffee:"))MoGemma supports Gemma 3 multimodal vision models.
- Install
mogemma[vision]to pass image file paths or raw image bytes directly.
from mogemma import SyncGemmaModel
# Initialize a vision-capable model
model = SyncGemmaModel("gemma3-4b-it")
response = model.generate("Describe this image in detail:", images=["input.jpg"])
print(response)import asyncio
from mogemma import AsyncGemmaModel
async def main():
model = AsyncGemmaModel()
async for token in model.generate_stream("Once upon a time"):
print(token, end="", flush=True)
asyncio.run(main())Generate dense vector embeddings natively through Mojo's optimized batched kernel operations. Pass a single string or a list of strings to process them in parallel.
from mogemma import EmbeddingModel
model = EmbeddingModel()
embeddings = model.embed(["Hello, world!", "Mojo runs Gemma inference."])
print(embeddings.shape) # (2, 768)All model classes default to gemma3-270m-it. Pass a model ID to use a different variant:
model = SyncGemmaModel("gemma3-1b-it")For full control over sampling parameters, pass a GenerationConfig:
from mogemma import GenerationConfig, SyncGemmaModel
config = GenerationConfig(model_path="gemma3-1b-it", temperature=0.7)
model = SyncGemmaModel(config)GenerationConfig and EmbeddingConfig accept:
device="cpu"device="gpu"device="gpu:0"(or other index)
Device handling is deterministic:
device="cpu"always runs on CPU- explicit GPU requests never silently fall back to CPU
- unavailable GPU requests raise an explicit error
Current runtime status:
cpuandgpuare executable backends todaygpu/gpu:Nexecute via a mathematically verified runtime polyfill
from mogemma import EmbeddingConfig, EmbeddingModel, GenerationConfig, SyncGemmaModel
generation = SyncGemmaModel(
GenerationConfig(
model_path="gemma3-1b-it",
device="cpu",
)
)
embeddings = EmbeddingModel(
EmbeddingConfig(
model_path="gemma3-1b-it",
device="cpu",
)
)MoGemma leverages the latest Mojo features for maximum performance.
- Mojo Nightly: Version
0.26.3.0.devor later is required for building from source. - Python: 3.10+
MoGemma automatically optimizes its Mojo core for your specific CPU architecture during the build process.
- x86_64: Uses
--target-cpu x86-64-v3for optimized vector instructions. - aarch64: Uses native ARM optimizations.
To build the Mojo extension locally:
make buildMIT