A production-ready streaming API for Microsoft's VibeVoice model, enabling high-quality multi-speaker text-to-speech with real-time WebSocket streaming and GPU acceleration.
- 🔄 Real-Time Streaming: WebSocket-based audio streaming with chunked delivery
- ⚡ GPU Accelerated: Optimized for NVIDIA GPUs with TF32 and cuDNN optimizations
- 🎯 Multi-Speaker TTS: Generate conversations with up to 4 distinct speakers
- 📊 Live Progress Updates: Real-time status updates during generation
- 🔄 Dual API Modes: WebSocket streaming + REST batch endpoints
- 🎵 Voice Presets: Pre-configured voice samples for immediate use
- 🐳 Docker Ready: Easy deployment with containerization support
- 💾 Optional File Saving: Save complete audio files alongside streaming
- Python 3.8+
- NVIDIA GPU with CUDA support (highly recommended)
- At least 8GB GPU memory for optimal performance
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone this repository
git clone https://github.com/dontriskit/VibeVoice-Streaming-API
cd VibeVoice-Streaming-API
# Create and activate virtual environment with uv
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
uv pip install -r requirements.txt
uv pip install -e .
# Install WebSocket support
uv pip install websockets# Clone this repository
git clone https://github.com/dontriskit/VibeVoice-Streaming-API
cd VibeVoice-Streaming-API
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
pip install -e .
pip install websockets# Using NVIDIA PyTorch Container
sudo docker run --privileged --net=host --ipc=host \
--ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all \
--rm -it nvcr.io/nvidia/pytorch:24.07-py3
# Inside container
git clone <your-repo-url>
cd VibeVoice-Streaming-API
pip install -r requirements.txt
pip install -e .
pip install websockets# Basic usage (defaults to port 8000)
python main.py
# Custom host and port
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
# With custom number of workers (not recommended for GPU)
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1The API will be available at:
- REST API:
http://localhost:8000 - WebSocket:
ws://localhost:8000/ws/generate - Interactive Docs:
http://localhost:8000/docs
| Model | Context Length | Generation Length | Hugging Face |
|---|---|---|---|
| VibeVoice-1.5B | 64K | ~90 min | microsoft/VibeVoice-1.5B |
| VibeVoice-7B-Preview | 32K | ~45 min | WestZhang/VibeVoice-Large-pt |
Real-time audio streaming with live progress updates.
Connection Flow:
- Connect to WebSocket endpoint
- Send generation request as JSON
- Receive metadata and status updates
- Stream audio chunks in real-time
- Receive completion notification
Request Message:
{
"script": "Speaker 1: Hello, how are you today?\nSpeaker 2: I'm doing great!",
"speaker_names": ["en-Alice_woman", "en-Carter_man"],
"cfg_scale": 1.3,
"save_file": false
}Response Message Types:
- Status Update
{
"type": "status",
"message": "Processing input..."
}- Metadata (sent once before audio)
{
"type": "metadata",
"sample_rate": 16000,
"total_samples": 160000,
"channels": 1,
"dtype": "float32"
}- Audio Chunk (JSON + binary data)
{
"type": "audio_chunk",
"chunk_num": 0,
"total_chunks": 10,
"samples": 16000
}Followed by binary audio data as float32 bytes
- Completion
{
"type": "complete",
"message": "Audio generation complete",
"total_chunks": 10
}- File Saved (if save_file=true)
{
"type": "file_saved",
"path": "api_outputs/session-id.wav",
"session_id": "123e4567-e89b-12d3-a456-426614174000"
}- Error
{
"type": "error",
"message": "Error description"
}POST /generate/batch
Non-streaming batch generation endpoint (backward compatibility).
Request Body:
{
"script": "Speaker 1: Hello world!\nSpeaker 2: How are you?",
"speaker_names": ["en-Alice_woman", "en-Carter_man"],
"cfg_scale": 1.3,
"save_file": true
}Response:
{
"file_id": "123e4567-e89b-12d3-a456-426614174000",
"status": "completed",
"download_url": "/download/123e4567-e89b-12d3-a456-426614174000"
}Rate Limit: 5 requests per minute per IP
GET /download/{file_id}
Download a previously generated audio file.
Response: WAV audio file
GET /voices
Get all available voice presets.
Response:
{
"voices": [
"en-Alice_woman",
"en-Carter_man",
"en-Frank_man",
"en-Mary_woman_bgm",
"zh-Bowen_man",
"zh-Xinran_woman"
],
"count": 6
}GET /health
Check API status and GPU information.
Response:
{
"status": "healthy",
"model_loaded": true,
"active_connections": 2,
"gpu_available": true,
"gpu_name": "NVIDIA GeForce RTX 4090",
"gpu_memory_allocated": "3.45 GB",
"gpu_memory_reserved": "4.12 GB",
"gpu_memory_total": "24.00 GB"
}Your script should follow this format:
Speaker 1: First person's dialogue here.
Speaker 2: Second person's response.
Speaker 1: More dialogue from first person.
Important Notes:
- Each speaker line must start with "Speaker" followed by a number
- Speaker numbers should be consistent throughout the script
- Provide voice names in
speaker_namesarray matching the speaker order - Multi-line dialogue is automatically combined
import asyncio
import websockets
import json
import numpy as np
import scipy.io.wavfile as wavfile
async def stream_audio():
uri = "ws://localhost:8000/ws/generate"
async with websockets.connect(uri) as websocket:
# Send generation request
request = {
"script": "Speaker 1: Welcome to our podcast!\nSpeaker 2: Thanks for having me!",
"speaker_names": ["en-Alice_woman", "en-Carter_man"],
"cfg_scale": 1.3,
"save_file": False
}
await websocket.send(json.dumps(request))
print("Request sent, waiting for response...")
# Receive streaming audio
audio_chunks = []
sample_rate = 16000
while True:
message = await websocket.recv()
if isinstance(message, str):
# JSON message
data = json.loads(message)
if data["type"] == "status":
print(f"Status: {data['message']}")
elif data["type"] == "metadata":
sample_rate = data["sample_rate"]
print(f"Audio format: {sample_rate}Hz, {data['channels']} channel(s)")
print(f"Total samples: {data['total_samples']}")
elif data["type"] == "audio_chunk":
print(f"Chunk {data['chunk_num'] + 1}/{data['total_chunks']}")
elif data["type"] == "complete":
print("Generation complete!")
break
elif data["type"] == "error":
print(f"Error: {data['message']}")
break
else:
# Binary audio data
chunk = np.frombuffer(message, dtype=np.float32)
audio_chunks.append(chunk)
# Combine all chunks
if audio_chunks:
full_audio = np.concatenate(audio_chunks)
# Save to file
wavfile.write("output.wav", sample_rate, full_audio)
print(f"Saved audio to output.wav ({len(full_audio)} samples)")
# Run the async function
asyncio.run(stream_audio())const ws = new WebSocket('ws://localhost:8000/ws/generate');
const audioChunks = [];
let sampleRate = 16000;
ws.onopen = () => {
// Send generation request
ws.send(JSON.stringify({
script: "Speaker 1: Hello world!\nSpeaker 2: How are you?",
speaker_names: ["en-Alice_woman", "en-Carter_man"],
cfg_scale: 1.3,
save_file: false
}));
};
ws.onmessage = async (event) => {
if (typeof event.data === 'string') {
// JSON message
const data = JSON.parse(event.data);
if (data.type === 'status') {
console.log('Status:', data.message);
} else if (data.type === 'metadata') {
sampleRate = data.sample_rate;
console.log(`Audio: ${sampleRate}Hz, ${data.channels} channel(s)`);
} else if (data.type === 'audio_chunk') {
console.log(`Chunk ${data.chunk_num + 1}/${data.total_chunks}`);
} else if (data.type === 'complete') {
console.log('Generation complete!');
// Process audio chunks here
playAudio(audioChunks, sampleRate);
}
} else {
// Binary audio data
const arrayBuffer = await event.data.arrayBuffer();
const float32Array = new Float32Array(arrayBuffer);
audioChunks.push(float32Array);
}
};
function playAudio(chunks, sampleRate) {
// Combine chunks and play using Web Audio API
const audioContext = new AudioContext({sampleRate: sampleRate});
// Calculate total length
const totalLength = chunks.reduce((sum, chunk) => sum + chunk.length, 0);
const combinedAudio = new Float32Array(totalLength);
// Combine all chunks
let offset = 0;
for (const chunk of chunks) {
combinedAudio.set(chunk, offset);
offset += chunk.length;
}
// Create audio buffer and play
const audioBuffer = audioContext.createBuffer(1, totalLength, sampleRate);
audioBuffer.copyToChannel(combinedAudio, 0);
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioContext.destination);
source.start();
}import requests
import time
# Submit batch generation request
response = requests.post("http://localhost:8000/generate/batch", json={
"script": "Speaker 1: Welcome!\nSpeaker 2: Thanks!",
"speaker_names": ["en-Alice_woman", "en-Carter_man"],
"cfg_scale": 1.3,
"save_file": True
})
result = response.json()
file_id = result["file_id"]
print(f"File ID: {file_id}")
# Download the result
audio_response = requests.get(f"http://localhost:8000/download/{file_id}")
with open("output.wav", "wb") as f:
f.write(audio_response.content)
print("Audio saved as output.wav")# Check health and GPU status
curl "http://localhost:8000/health"
# Batch generation
curl -X POST "http://localhost:8000/generate/batch" \
-H "Content-Type: application/json" \
-d '{
"script": "Speaker 1: Hello!\nSpeaker 2: Hi there!",
"speaker_names": ["en-Alice_woman", "en-Carter_man"],
"cfg_scale": 1.3
}'
# Download generated file
curl -o output.wav "http://localhost:8000/download/YOUR_FILE_ID"
# List available voices
curl "http://localhost:8000/voices"By default, the API uses microsoft/VibeVoice-1.5B. To use the 7B model, modify the startup function:
model_path = "WestZhang/VibeVoice-Large-pt"Adjust chunk size for different latency/bandwidth tradeoffs:
generator = StreamingAudioGenerator(model, processor, chunk_size=16000) # 1 second chunks
# chunk_size=8000 # 0.5 seconds (lower latency)
# chunk_size=32000 # 2 seconds (lower overhead)Place your custom voice samples in demo/voices/ directory. Supported format: WAV files.
The API automatically enables:
- TF32: Faster computation on Ampere+ GPUs (RTX 3000/4000 series)
- cuDNN benchmark: Auto-tunes for optimal performance
- bfloat16 precision: Memory-efficient on GPU
To disable optimizations, comment out in startup function:
# torch.backends.cudnn.benchmark = True
# torch.backends.cuda.matmul.allow_tf32 = True
# torch.backends.cudnn.allow_tf32 = True- Minimum: 8GB VRAM for 1.5B model
- Recommended: 16GB+ VRAM for 7B model
- CPU inference is supported but 10-50x slower
For optimal Chinese speech generation:
- Use English punctuation (commas and periods only)
- Consider using the 7B model for better stability
- Avoid special Chinese quotation marks
The model may spontaneously generate background music:
- Voice samples with BGM increase the likelihood
- Introductory phrases ("Welcome to", "Hello") may trigger BGM
- Using "Alice" voice preset has higher BGM probability
- This is an intentional feature, not a bug
- Handle disconnections gracefully with reconnection logic
- Process audio chunks incrementally for real-time playback
- Consider buffering for smoother playback
- Monitor memory usage when accumulating chunks
┌─────────────────────────────────────────────────────────┐
│ FastAPI Application │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌──────────────────────┐ │
│ │ WebSocket │ │ REST Endpoints │ │
│ │ /ws/generate │ │ /generate/batch │ │
│ └────────┬────────┘ └──────────┬───────────┘ │
│ │ │ │
│ └──────────┬───────────────────┘ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ StreamingAudioGenerator │ │
│ │ - Chunked generation │ │
│ │ - Real-time streaming │ │
│ └────────────┬────────────┘ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ VibeVoice Model │ │
│ │ (GPU Accelerated) │ │
│ │ - TF32 enabled │ │
│ │ - cuDNN optimized │ │
│ └─────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Key Features:
- Real-time WebSocket streaming with chunked delivery
- GPU-accelerated inference with automatic optimization
- Dual API modes: streaming and batch
- Live progress updates during generation
- Optional file persistence
# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"
# Check GPU info
nvidia-smi- Reduce
cfg_scale(lower values use less memory) - Use 1.5B model instead of 7B
- Close other GPU applications
- Reduce chunk size for streaming
- Check firewall settings
- Ensure port 8000 is accessible
- Verify WebSocket protocol (ws:// not http://)
- Check for proxy/load balancer WebSocket support
This project is a streaming API wrapper around Microsoft's VibeVoice model. Please refer to the original VibeVoice repository for licensing terms and model details.
Responsible AI Usage:
- Disclose AI-generated content when sharing
- Ensure compliance with local laws and regulations
- Verify content accuracy and avoid misleading applications
- Do not use for deepfakes or disinformation
- Respect voice actor rights and consent
Technical Limitations:
- English and Chinese only
- No overlapping speech generation
- Speech synthesis only (no background noise/music control)
- Streaming latency depends on generation speed
- Not recommended for commercial use without additional testing
Model is intended for research and development purposes. Use responsibly.
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.