Open-source AI audiobook studio — a free, local alternative to ElevenLabs.
Powered by Qwen3-TTS models. Three voice creation modes, per-sentence granular control, LLM-powered smart character analysis, and multi-character mixed-voice generation. Runs 100% locally on your GPU — zero API costs, your data never leaves your device.
| Capability | VibeVoice | ElevenLabs |
|---|---|---|
| Per-sentence emotion/voice control | ★★★★★ | ★★ |
| Voice creation modes | ★★★★★ Preset+Clone+Design | ★★★★ Clone+Preset |
| Smart character analysis | ★★★★★ LLM auto-analysis | ★★★ Manual tagging |
| Editing tools | ★★★★ Sentence editor+Preview | ★★ Basic editing |
| Voice quality | ★★★ | ★★★★★ |
| Real-time/streaming | ★★★ | ★★★★★ |
| Language coverage | ★★★★ 10 languages | ★★★★★ 32 languages |
| Cost | ★★★★★ Completely free | ★★ $5–$99/mo |
| Privacy | ★★★★★ 100% local | ★★ Cloud processing |
Key advantages:
- Per-sentence granular control — Set voice and emotion independently for each sentence. ElevenLabs only supports global settings. Critical for audiobooks where dialogue and narration need different tones
- LLM smart analysis — Qwen3-4B automatically identifies characters and emotions in text, one-click voice assignment. ElevenLabs requires manual paragraph-by-paragraph tagging
- Voice design — Create voices from natural language descriptions (e.g., "deep husky middle-aged male voice"), no reference audio needed. ElevenLabs doesn't offer this
- Free + private — Runs on your local GPU, no per-character billing, no usage limits, data never leaves your machine
| Mode | Description | Use Case |
|---|---|---|
| Preset Speakers | 9 built-in voices + emotion instruction control | Quick generation, no assets needed |
| Voice Cloning | Clone from 3s of reference audio | Replicate a specific voice |
| Voice Design | Create voice from natural language description | Design a new voice from scratch |
- Sentence preview — Preview sentence splits before generation, edit text, adjust emotions, insert/delete, then generate
- Per-sentence voice — Each sentence can use a different voice (preset/library), interleave narration and character dialogue
- Per-sentence emotion — Each sentence gets its own emotion instruction (happy/sad/angry...)
- Per-sentence editing — Double-click to edit text, regenerate, delete, undo, single-sentence preview playback
- Inter-sentence pause — 0x–2x slider for real-time silence duration adjustment between sentences
Powered by Qwen3-4B, automatically identifies characters and emotions in text. Character panel with one-click voice assignment, analysis results auto-fill per-sentence emotion instructions.
Sidebar project tree navigation, multi-project multi-chapter organization. Project-level character-voice mapping shared across chapters. IndexedDB persistence — refresh without losing work.
Reference audio uploaded for voice cloning is automatically cleaned in three stages:
| Stage | Tool | Effect |
|---|---|---|
| Vocal extraction | Demucs (Meta) | Remove background music, noise, leaving clean vocals |
| Silence trimming | Silero VAD | Remove leading/trailing silence, compress long internal pauses |
| Text normalization | wetext | Convert numbers/dates/currency to spoken form for TTS |
All are optional dependencies — if not installed, the pipeline gracefully skips that stage.
Loudness normalization: Every generated sentence is automatically normalized to -16 LUFS (audiobook/streaming standard) via pyloudnorm. Ensures consistent volume across different speakers and voice modes — no more jarring volume jumps when mixing voices.
- Waveform visualizer — WaveSurfer.js waveform player, highlights current sentence during playback
- Subtitle export — Auto-generates SRT/VTT subtitle files for video production
- MP3 export — In-browser conversion and download
- Auto language detection — Automatically sets language based on input text
- Generation timer — Real-time elapsed time display during generation
- Keyboard shortcuts — Space=play, arrows=navigate, Enter=regenerate, Ctrl+Z=undo
- REST API — Full HTTP API for integration
- Voice prompt caching — Disk-cached voice prompts for faster subsequent generation
- Batch inference — Multiple sentences generated per model call (BATCH_SIZE=16), ~2x faster
- Edit/Result view toggle — Switch between text editing and sentence results without losing state
Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
| Speaker | Language | Gender |
|---|---|---|
| vivian | Chinese | Female |
| uncle_fu | Chinese | Male |
| aiden | English | Male |
| serena | English | Female |
| ono_anna | Japanese | Female |
| sohee | Korean | Female |
| dylan | - | Male |
| eric | - | Male |
| ryan | - | Male |
- Python 3.8+
- CUDA-compatible GPU (recommended 8GB+ VRAM), or macOS Apple Silicon (M1/M2/M3)
- PyTorch with CUDA or MPS support
macOS Users: Apple Silicon chips are supported via MPS backend with automatic detection using float16 precision. Actual compatibility depends on qwen-tts library's MPS support.
pip install -U qwen-tts fastapi uvicorn python-multipart soundfile numpy torchOptional (recommended):
pip install wetext # Chinese text normalization (numbers/dates → spoken form)
pip install silero-vad # Silence trimming for clone reference audio
pip install demucs # Vocal extraction from noisy/music-mixed reference audio
pip install pyloudnorm # Loudness normalization (-16 LUFS) for consistent volumepip install -U modelscope
# CustomVoice model (preset speakers)
modelscope download --model Qwen/Qwen3-TTS-1.7B-CustomVoice --local_dir ./models/Qwen3-TTS-1.7B-CustomVoice
# VoiceDesign model (voice design)
modelscope download --model Qwen/Qwen3-TTS-1.7B-VoiceDesign --local_dir ./models/Qwen3-TTS-1.7B-VoiceDesign
# Base model (voice cloning)
modelscope download --model Qwen/Qwen3-TTS-0.6B --local_dir ./models/Qwen3-TTS-0.6Bpython api_server.pyServer runs at http://localhost:8001
Open http://localhost:8001 in your browser to access the web UI.
Installing Flash Attention can improve inference speed by approximately 50%.
Linux:
pip install flash-attn --no-build-isolationWindows:
Windows does not support source compilation. Use pre-built wheels from kingbri1/flash-attention.
Example (Python 3.10 + PyTorch 2.9 + CUDA 12.8):
# Upgrade PyTorch first
pip install torch==2.9.0 torchaudio --index-url https://download.pytorch.org/whl/cu128
# Install pre-built flash-attn
pip install https://github.com/kingbri1/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu128torch2.9.0cxx11abiFALSE-cp310-cp310-win_amd64.whlVerify installation: Server should no longer show Warning: flash-attn is not installed on startup.
To achieve higher performance (e.g., the official benchmark of 97ms/char):
| Solution | Expected Improvement | Notes |
|---|---|---|
| Better GPU | 2-5x | A100/H100 vs consumer GPUs |
| vLLM Deployment | 2-3x | PagedAttention + continuous batching |
| TensorRT-LLM | 2-5x | NVIDIA official inference optimization |
| FP8 Quantization | 1.5-2x | Requires H100 |
Consumer GPUs (RTX 40 series) + Flash Attention achieving ~1.4s/char is a reasonable expectation.
# GET request
curl "http://localhost:8001/tts?text=Hello&speaker=aiden&language=English" -o output.wav
# With emotion instruction
curl "http://localhost:8001/tts?text=Hello&speaker=aiden&language=English&instruct=say it happily" -o output.wavParameters:
| Parameter | Description | Default |
|---|---|---|
| text | Text to synthesize | required |
| speaker | Speaker name | vivian |
| language | Language | Chinese |
| instruct | Emotion instruction | optional |
curl -X POST "http://localhost:8001/clone" \
-F "audio=@reference.wav" \
-F "text=Hello world" \
-F "language=English" \
-F "ref_text=optional transcript of reference audio" \
-o output.wavParameters:
| Parameter | Description |
|---|---|
| audio | Reference audio file (3-10s) |
| text | Text to synthesize |
| language | Language |
| ref_text | Transcript of reference audio (optional, improves quality) |
# List saved voices
curl http://localhost:8001/voices
# Save a voice
curl -X POST "http://localhost:8001/voices/save" \
-F "name=MyVoice" \
-F "language=English" \
-F "audio=@reference.wav"
# Use saved voice
curl -X POST "http://localhost:8001/voices/{voice_id}/tts" \
-F "text=Hello world" \
-o output.wav
# Delete voice
curl -X DELETE "http://localhost:8001/voices/{voice_id}"# Get available speakers
curl http://localhost:8001/speakers
# Get supported languages
curl http://localhost:8001/languages├── api_server.py # FastAPI server
├── index.html # Web UI (HTML shell)
├── static/
│ ├── style.css # Styles
│ └── js/
│ ├── i18n.js # Chinese/English translations
│ ├── state.js # Global state + IndexedDB project/chapter persistence
│ ├── audio.js # Waveform player, WAV/MP3 encode/decode, audio merging
│ ├── editor.js # Sentence editor, sentence preview, mode switching
│ ├── voice.js # Voice library UI, recording, voice design
│ ├── generation.js # Generation dispatch, SSE progress, regeneration
│ ├── shortcuts.js # Keyboard shortcuts
│ └── main.js # Entry point initialization
├── test_qwen_tts.py # Test script
├── models/ # Model files (not in repo)
└── saved_voices/ # Saved cloned voices
Performance Optimization
- Batch inference: multiple sentences per model call (BATCH_SIZE=16), ~2x speedup for preset mode
- SDPA (Scaled Dot-Product Attention) enabled via
attn_implementation="sdpa"for fused attention kernels - GPU warmup: dummy forward pass after model loading prevents first-generation slowdown
torch.inference_mode()wrapper for all inference entry points
Audio Preprocessing Pipeline
- Demucs (Meta) vocal extraction: automatically removes background music/noise from clone reference audio (80MB model, GPU ~1s/10s audio)
- Silero VAD silence trimming: removes leading/trailing silence and compresses excessive internal pauses from reference audio
- wetext Chinese text normalization: converts numbers, dates, currency, percentages to spoken form before TTS (e.g., "100元" → "一百元")
- pyloudnorm loudness normalization: every generated sentence normalized to -16 LUFS, consistent volume across all speakers/modes
- All four are optional dependencies — graceful fallback if not installed
UI Improvements
- Edit/Result view toggle: switch between text editing and sentence editor without losing any state
- Progress simulation: smooth per-sentence highlighting during batch generation (asymptotic ease-out curve)
Sentence Preview Mode
- Preview sentence splits before generation, edit text, adjust emotion instructions, insert/delete sentences
- "Preview" → Edit → "Generate" three-step workflow, reducing wasted generations
Per-Sentence Emotion & Voice Configuration
- Each sentence can have its own emotion instruction (preset mode), independent of global emotion
- Each sentence can use a different voice (preset speaker / voice library), enabling mixed-voice generation
- Generation timer: real-time elapsed time display during generation (100ms refresh)
LLM Smart Character/Emotion Analysis
- Integrated Qwen3-4B model for automatic character and emotion recognition in text
- Character panel at the top of sentence editor with one-click voice assignment per character
- Analysis results auto-fill per-sentence emotion instructions
Project/Chapter Management
- Sidebar project tree navigation, multi-project multi-chapter organization
- Project-level character-voice mapping shared across chapters
- IndexedDB persistence (migrated from single-session to multi-project architecture), auto-migrates old data
Voice Design Independent Mode
- Voice design restored as independent 3rd tab (Preset | Library | Design)
- Preserves natural language expressiveness of voice descriptions, no longer downgraded to clone prompt
Other Improvements
- Paragraph boundary preservation: multi-paragraph text retains line break structure after sentence splitting
- Generation stats displayed in top-right status bar
- Keyboard shortcuts: Space=play, arrows=navigate, Enter=regenerate, P=preview, Delete=delete, Ctrl+Z=undo
Waveform Visualizer
- Integrated WaveSurfer.js replacing the simple progress bar, displaying audio waveform
- Highlights the current sentence during playback, dims already-played sentences
Sentence Editor
- Enter sentence editor view after generation, supporting per-sentence operations
- Click to select, double-click to edit text
- Per-sentence regeneration (with spinner feedback), undo support to revert to previous version
- Per-sentence deletion (with confirmation)
- Insert new sentences between existing ones (with placeholder row, spinner, disabled actions, auto-play on completion)
- Inter-sentence pause control (0x–2x slider, adjusts silence duration between sentences in real-time)
- Per-sentence preview playback (play button for individual sentence audio)
Session Persistence (IndexedDB)
- Generation results (per-sentence audio, text, subtitles, parameters) automatically saved to IndexedDB
- Auto-restores previous session on page refresh, no need to regenerate
- Preserves editing state including inter-sentence pause multiplier
Voice Design Cross-Sentence Timbre Consistency
- Multi-sentence generation uses the design model for the first sentence, then automatically switches to clone model + first-sentence prompt for subsequent sentences, ensuring consistent timbre
- Regeneration and sentence insertion also reuse the cached voice prompt for consistency
- Single-sentence text has no extra overhead, still uses pure design model
- Automatic fallback to design model when clone model is not loaded
Backend
- All 4 progress endpoints (tts/clone/design/saved_voice) return per-sentence base64 audio array
sentence_audios - New
POST /regenerateendpoint for single-sentence regeneration (preset/clone/design/saved_voice modes) - Clone session prompt caching mechanism (
clone_session_prompts) with 1-hour auto-expiry
- Sentence-by-sentence progress display, stop generation, subtitle generation
- Voice prompt disk caching, MP3 export
- Voice design mode, auto language detection
- Preset speaker synthesis (9 voices + emotion control)
- Voice cloning (record/upload reference audio)
- Voice library management
- Multilingual support (10 languages)
- REST API, bilingual UI (Chinese/English)
| Status | Feature | Description |
|---|---|---|
| ✅ | Per-sentence voice/emotion control | Independent voice and emotion per sentence |
| ✅ | LLM smart character analysis | Qwen3-4B auto character and emotion recognition |
| ✅ | Project/chapter management | Multi-project, multi-chapter, sidebar tree |
| ✅ | Voice design independent mode | Create voices from natural language descriptions |
| 🔲 | Long text import | TXT/EPUB/Word file import |
| 🔲 | Whisper auto-transcription | Auto-transcribe reference audio text |
| 🔲 | Speed/pitch control | Per-sentence speed and pitch adjustment |
| ✅ | Audio preprocessing | Demucs vocal extraction + VAD silence trimming + text normalization |
| 🔲 | Pronunciation dictionary | Custom pronunciation for names/terms |
| 🔲 | Real-time streaming | Pending upstream SDK support |
VibeVoice uses Qwen3-TTS models. Please refer to Qwen3-TTS for model license terms.
