AI-powered Text-to-Speech desktop application with voice cloning — built on OminiX MLX
Moxin Voice is a modern, GPU-accelerated desktop TTS application built entirely in Rust. It uses the Makepad UI framework for native performance and the OminiX MLX inference stack for high-speed, Python-free speech synthesis on Apple Silicon.
The inference engine behind Moxin Voice is OminiX MLX — a comprehensive Rust-native ML inference ecosystem for Apple Silicon.
OminiX MLX provides:
- Pure Rust inference — no Python runtime required at synthesis time
- Metal GPU acceleration — optimized for M1/M2/M3/M4 chips via Apple's MLX framework
- Unified memory — zero-copy CPU/GPU data sharing
- Qwen3-TTS-MLX — the TTS engine used by Moxin Voice (9 built-in voices, 12 languages, ICL voice cloning, 2.3× real-time on M3 Max)
Moxin Voice uses OminiX MLX's
dora-qwen3-tts-mlxnode as its sole TTS backend. Source:node-hub/dora-qwen3-tts-mlx/
- 🎙️ Zero-Shot Voice Cloning — Clone any voice with 5–30 seconds of audio (ICL Express mode)
- 🎵 Text-to-Speech — 9 preset voices across Chinese, English, Japanese, and Korean
- 🔮 Qwen3-TTS-MLX Backend — 2.3× real-time synthesis via OminiX MLX on Apple Silicon
- 🎤 Audio Recording — Built-in real-time recording with waveform visualization
- 🔍 ASR Integration — Automatic text transcription for cloning reference audio
- 💾 Audio Export — Save generated speech as WAV files
- 🌓 Dark Mode — Native dark theme via Makepad GPU rendering
- 🌐 Bilingual UI — Chinese and English interface
moxin-voice/
├── moxin-voice-shell/ # Application entry point (binary)
├── apps/moxin-voice/ # UI + application logic
│ └── dataflow/tts.yml # Dora dataflow graph
├── moxin-widgets/ # Shared Makepad UI components
├── moxin-ui/ # Application infrastructure
├── moxin-dora-bridge/ # Dora dataflow integration bridge
└── node-hub/
├── dora-qwen3-tts-mlx/ # ★ OminiX MLX Qwen3-TTS Rust node
│ └── previews/ # Pre-generated voice preview WAVs
└── dora-qwen3-asr/ # ★ OminiX MLX Qwen3-ASR Rust node
The TTS pipeline runs as a Dora dataflow: the UI sends text, the qwen-tts-node (built from dora-qwen3-tts-mlx) synthesizes audio using OminiX MLX, and the audio player receives the stream.
- macOS 14.0+ (Sonoma), Apple Silicon (M1/M2/M3/M4)
- Rust 1.82+
- Dora CLI (
cargo install dora-cli) - Python 3.8+ (for the one-time model download script; not required at runtime)
bash scripts/init_qwen3_models.shThis downloads all three model snapshots into ~/.OminiX/models/:
| Model | Purpose |
|---|---|
Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit |
Preset voice synthesis |
Qwen3-TTS-12Hz-1.7B-Base-8bit |
ICL zero-shot voice cloning |
Qwen3-ASR-1.7B-8bit |
Voice cloning reference audio transcription |
huggingface_hub is installed automatically if not present.
cargo build --releaseThis builds all binaries including dora-qwen3-asr (the ASR Dora node) and qwen-tts-node.
dora up
cargo run -p moxin-voice-shellFor end-users receiving the distributed .app, model download and initialization happen automatically via the in-app bootstrap wizard on first launch.
9 built-in preset voices, UI names localized to Chinese or English:
| ID | Language | Character |
|---|---|---|
vivian |
zh | 薇薇安 — bright, slightly edgy young female |
serena |
zh | 赛琳娜 — warm, gentle young female |
uncle_fu |
zh | 傅叔 — low, mellow seasoned male |
dylan |
zh | 迪伦 — clear Beijing young male |
eric |
zh | 埃里克 — lively Chengdu young male |
ryan |
en | Ryan — dynamic male with rhythmic drive |
aiden |
en | Aiden — sunny American male |
ono_anna |
ja | 小野安奈 — playful Japanese female |
sohee |
ko | 素熙 — warm Korean female |
Upload or record 5–30 seconds of reference audio. Moxin Voice uses Qwen3-TTS's In-Context Learning (ICL) to clone the voice in real time — no training required. ASR auto-transcription is optional; if ASR is unavailable, users can enter reference text manually.
cargo build -p moxin-voice-shellbash scripts/build_macos_app.sh --version 0.1.0
bash scripts/build_macos_dmg.shbash scripts/macos_bootstrap.shDownloads Qwen3-TTS and Qwen3-ASR models, sets up the app-private conda env (needed for TTS download script only).
| Component | Technology |
|---|---|
| UI framework | Makepad — GPU-accelerated, pure Rust |
| TTS inference | OminiX MLX · Qwen3-TTS-MLX |
| TTS model | Qwen3-TTS (Alibaba) |
| ML runtime | Apple MLX via mlx-sys / mlx-rs (OminiX MLX) |
| Dataflow | Dora |
| Audio I/O | CPAL |
| ASR | OminiX MLX · Qwen3-ASR-MLX (Rust, Metal GPU) |
| Language | Rust 2021 edition |
Apache License 2.0 — see LICENSE.
- OminiX MLX — the core ML inference engine powering all synthesis in this project
- Qwen3-TTS — the TTS model (Alibaba)
- Makepad — GPU-accelerated UI framework
- Dora — dataflow architecture
- Apple MLX — foundation for OminiX MLX
Repository: https://github.com/moxin-org/Moxin-Voice