Run Google's Gemma 4 entirely in your browser.
No API keys. No server. No data leaving your machine.
Onyx is a demo website and Python toolkit for running Google's Gemma 4 models directly in your browser using WebGPU. Everything runs locally on your device.
- Multimodal Chat - text, images, and audio, all processed in-browser
- E2B vs E4B Arena - same prompt, two models, side-by-side speed and quality comparison
- Conversion Toolkit - Python scripts to convert, validate, and benchmark Gemma 4 ONNX models
The demo site uses Transformers.js with WebGPU acceleration to run Gemma 4 E2B (2.3B params, ~3.2 GB total with all encoders) and E4B (~5 GB) in a Web Worker. Models are quantized to 4-bit (q4f16) ONNX format and cached locally after first download.
- Chrome 113+ or Edge 113+ with WebGPU enabled
- 4 GB GPU memory for E2B, 8 GB for E4B
cd web
npm install
npm run devOpen http://localhost:5173.
- / - Landing page with WebGPU compatibility check
- /playground - Multimodal chat with model selection (E2B / E4B)
- /arena - Side-by-side sequential race comparing E2B vs E4B
Python scripts for converting Gemma 4 models to browser-ready ONNX format.
cd toolkit
pip install -r requirements.txtpython convert.py --model google/gemma-4-E2B-it --output output/e2b --quant q4Options: --quant fp16, --quant q8, --quant q4
python validate.py --converted output/e2b/onnx_q4 --quickOr compare against the original:
python validate.py --original google/gemma-4-E2B-it --converted output/e2b/onnx_q4python benchmark.py --model google/gemma-4-E2B-it --quant-levels fp16 q8 q4| Component | Technology |
|---|---|
| Frontend | React 19, TypeScript, Vite, Tailwind CSS 4 |
| ML Inference | Transformers.js, WebGPU, ONNX |
| Conversion | optimum-onnx, transformers, onnxruntime |
| Model | Params | Size (q4f16) | Speed (M3 Pro) |
|---|---|---|---|
| E2B | 2.3B effective | ~3.2 GB | ~5-20 tok/s |
| E4B | 4B effective | ~5 GB | ~3-15 tok/s |
Models from onnx-community/gemma-4-E2B-it-ONNX and onnx-community/gemma-4-E4B-it-ONNX.
MIT