Skip to content

moxin-org/Moxin-Voice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

182 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Moxin Voice

AI-powered Text-to-Speech desktop application with voice cloning — built on OminiX MLX

License Rust Platform

Moxin Voice is a modern, GPU-accelerated desktop TTS application built entirely in Rust. It uses the Makepad UI framework for native performance and the OminiX MLX inference stack for high-speed, Python-free speech synthesis on Apple Silicon.


⚡ Powered by OminiX MLX

The inference engine behind Moxin Voice is OminiX MLX — a comprehensive Rust-native ML inference ecosystem for Apple Silicon.

OminiX MLX provides:

  • Pure Rust inference — no Python runtime required at synthesis time
  • Metal GPU acceleration — optimized for M1/M2/M3/M4 chips via Apple's MLX framework
  • Unified memory — zero-copy CPU/GPU data sharing
  • Qwen3-TTS-MLX — the TTS engine used by Moxin Voice (9 built-in voices, 12 languages, ICL voice cloning, 2.3× real-time on M3 Max)

Moxin Voice uses OminiX MLX's dora-qwen3-tts-mlx node as its sole TTS backend. Source: node-hub/dora-qwen3-tts-mlx/


✨ Features

  • 🎙️ Zero-Shot Voice Cloning — Clone any voice with 5–30 seconds of audio (ICL Express mode)
  • 🎵 Text-to-Speech — 9 preset voices across Chinese, English, Japanese, and Korean
  • 🔮 Qwen3-TTS-MLX Backend — 2.3× real-time synthesis via OminiX MLX on Apple Silicon
  • 🎤 Audio Recording — Built-in real-time recording with waveform visualization
  • 🔍 ASR Integration — Automatic text transcription for cloning reference audio
  • 💾 Audio Export — Save generated speech as WAV files
  • 🌓 Dark Mode — Native dark theme via Makepad GPU rendering
  • 🌐 Bilingual UI — Chinese and English interface

🏗️ Architecture

moxin-voice/
├── moxin-voice-shell/          # Application entry point (binary)
├── apps/moxin-voice/           # UI + application logic
│   └── dataflow/tts.yml        # Dora dataflow graph
├── moxin-widgets/              # Shared Makepad UI components
├── moxin-ui/                   # Application infrastructure
├── moxin-dora-bridge/          # Dora dataflow integration bridge
└── node-hub/
    ├── dora-qwen3-tts-mlx/     # ★ OminiX MLX Qwen3-TTS Rust node
    │   └── previews/           # Pre-generated voice preview WAVs
    └── dora-qwen3-asr/         # ★ OminiX MLX Qwen3-ASR Rust node

The TTS pipeline runs as a Dora dataflow: the UI sends text, the qwen-tts-node (built from dora-qwen3-tts-mlx) synthesizes audio using OminiX MLX, and the audio player receives the stream.


🚀 Quick Start (macOS)

Prerequisites

  • macOS 14.0+ (Sonoma), Apple Silicon (M1/M2/M3/M4)
  • Rust 1.82+
  • Dora CLI (cargo install dora-cli)
  • Python 3.8+ (for the one-time model download script; not required at runtime)

1. Download Models

bash scripts/init_qwen3_models.sh

This downloads all three model snapshots into ~/.OminiX/models/:

Model Purpose
Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit Preset voice synthesis
Qwen3-TTS-12Hz-1.7B-Base-8bit ICL zero-shot voice cloning
Qwen3-ASR-1.7B-8bit Voice cloning reference audio transcription

huggingface_hub is installed automatically if not present.

2. Build

cargo build --release

This builds all binaries including dora-qwen3-asr (the ASR Dora node) and qwen-tts-node.

3. Run

dora up
cargo run -p moxin-voice-shell

First-Time Distribution (macOS .app)

For end-users receiving the distributed .app, model download and initialization happen automatically via the in-app bootstrap wizard on first launch.


🔮 Qwen3-TTS Voice Library

9 built-in preset voices, UI names localized to Chinese or English:

ID Language Character
vivian zh 薇薇安 — bright, slightly edgy young female
serena zh 赛琳娜 — warm, gentle young female
uncle_fu zh 傅叔 — low, mellow seasoned male
dylan zh 迪伦 — clear Beijing young male
eric zh 埃里克 — lively Chengdu young male
ryan en Ryan — dynamic male with rhythmic drive
aiden en Aiden — sunny American male
ono_anna ja 小野安奈 — playful Japanese female
sohee ko 素熙 — warm Korean female

Voice Cloning (Express Mode)

Upload or record 5–30 seconds of reference audio. Moxin Voice uses Qwen3-TTS's In-Context Learning (ICL) to clone the voice in real time — no training required. ASR auto-transcription is optional; if ASR is unavailable, users can enter reference text manually.


📦 Build

Development

cargo build -p moxin-voice-shell

macOS App Bundle

bash scripts/build_macos_app.sh --version 0.1.0
bash scripts/build_macos_dmg.sh

Distribution Bootstrap (user machine)

bash scripts/macos_bootstrap.sh

Downloads Qwen3-TTS and Qwen3-ASR models, sets up the app-private conda env (needed for TTS download script only).


🔧 Technology Stack

Component Technology
UI framework Makepad — GPU-accelerated, pure Rust
TTS inference OminiX MLX · Qwen3-TTS-MLX
TTS model Qwen3-TTS (Alibaba)
ML runtime Apple MLX via mlx-sys / mlx-rs (OminiX MLX)
Dataflow Dora
Audio I/O CPAL
ASR OminiX MLX · Qwen3-ASR-MLX (Rust, Metal GPU)
Language Rust 2021 edition

📝 License

Apache License 2.0 — see LICENSE.


🙏 Acknowledgments

  • OminiX MLX — the core ML inference engine powering all synthesis in this project
  • Qwen3-TTS — the TTS model (Alibaba)
  • Makepad — GPU-accelerated UI framework
  • Dora — dataflow architecture
  • Apple MLX — foundation for OminiX MLX

Repository: https://github.com/moxin-org/Moxin-Voice

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors