🎙️ VibeVoice Streaming API - Real-Time Multi-Speaker TTS

A production-ready streaming API for Microsoft's VibeVoice model, enabling high-quality multi-speaker text-to-speech with real-time WebSocket streaming and GPU acceleration.

✨ Features

🔄 Real-Time Streaming: WebSocket-based audio streaming with chunked delivery
⚡ GPU Accelerated: Optimized for NVIDIA GPUs with TF32 and cuDNN optimizations
🎯 Multi-Speaker TTS: Generate conversations with up to 4 distinct speakers
📊 Live Progress Updates: Real-time status updates during generation
🔄 Dual API Modes: WebSocket streaming + REST batch endpoints
🎵 Voice Presets: Pre-configured voice samples for immediate use
🐳 Docker Ready: Easy deployment with containerization support
💾 Optional File Saving: Save complete audio files alongside streaming

🚀 Quick Start

Prerequisites

Python 3.8+
NVIDIA GPU with CUDA support (highly recommended)
At least 8GB GPU memory for optimal performance

Installation

Option 1: Standard Installation with uv (Recommended)

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone this repository
git clone https://github.com/dontriskit/VibeVoice-Streaming-API
cd VibeVoice-Streaming-API

# Create and activate virtual environment with uv
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
uv pip install -r requirements.txt
uv pip install -e .

# Install WebSocket support
uv pip install websockets

Option 2: Traditional pip Installation

# Clone this repository
git clone https://github.com/dontriskit/VibeVoice-Streaming-API
cd VibeVoice-Streaming-API

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -e .
pip install websockets

Option 3: Docker Installation with GPU Support

# Using NVIDIA PyTorch Container
sudo docker run --privileged --net=host --ipc=host \
  --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all \
  --rm -it nvcr.io/nvidia/pytorch:24.07-py3

# Inside container
git clone <your-repo-url>
cd VibeVoice-Streaming-API
pip install -r requirements.txt
pip install -e .
pip install websockets

Starting the API Server

# Basic usage (defaults to port 8000)
python main.py

# Custom host and port
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

# With custom number of workers (not recommended for GPU)
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1

The API will be available at:

REST API: http://localhost:8000
WebSocket: ws://localhost:8000/ws/generate
Interactive Docs: http://localhost:8000/docs

📚 API Documentation

Available Models

Model	Context Length	Generation Length	Hugging Face
VibeVoice-1.5B	64K	~90 min	microsoft/VibeVoice-1.5B
VibeVoice-7B-Preview	32K	~45 min	WestZhang/VibeVoice-Large-pt

WebSocket Streaming Endpoint

WebSocket `/ws/generate`

Real-time audio streaming with live progress updates.

Connection Flow:

Connect to WebSocket endpoint
Send generation request as JSON
Receive metadata and status updates
Stream audio chunks in real-time
Receive completion notification

Request Message:

{
  "script": "Speaker 1: Hello, how are you today?\nSpeaker 2: I'm doing great!",
  "speaker_names": ["en-Alice_woman", "en-Carter_man"],
  "cfg_scale": 1.3,
  "save_file": false
}

Response Message Types:

Status Update

{
  "type": "status",
  "message": "Processing input..."
}

Metadata (sent once before audio)

{
  "type": "metadata",
  "sample_rate": 16000,
  "total_samples": 160000,
  "channels": 1,
  "dtype": "float32"
}

Audio Chunk (JSON + binary data)

{
  "type": "audio_chunk",
  "chunk_num": 0,
  "total_chunks": 10,
  "samples": 16000
}

Followed by binary audio data as float32 bytes

Completion

{
  "type": "complete",
  "message": "Audio generation complete",
  "total_chunks": 10
}

File Saved (if save_file=true)

{
  "type": "file_saved",
  "path": "api_outputs/session-id.wav",
  "session_id": "123e4567-e89b-12d3-a456-426614174000"
}

Error

{
  "type": "error",
  "message": "Error description"
}

REST API Endpoints

1. Batch Generation

POST /generate/batch

Non-streaming batch generation endpoint (backward compatibility).

Request Body:

{
  "script": "Speaker 1: Hello world!\nSpeaker 2: How are you?",
  "speaker_names": ["en-Alice_woman", "en-Carter_man"],
  "cfg_scale": 1.3,
  "save_file": true
}

Response:

{
  "file_id": "123e4567-e89b-12d3-a456-426614174000",
  "status": "completed",
  "download_url": "/download/123e4567-e89b-12d3-a456-426614174000"
}

Rate Limit: 5 requests per minute per IP

2. Download Generated File

GET /download/{file_id}

Download a previously generated audio file.

Response: WAV audio file

3. List Available Voices

GET /voices

Get all available voice presets.

Response:

{
  "voices": [
    "en-Alice_woman",
    "en-Carter_man",
    "en-Frank_man",
    "en-Mary_woman_bgm",
    "zh-Bowen_man",
    "zh-Xinran_woman"
  ],
  "count": 6
}

4. Health Check

GET /health

Check API status and GPU information.

Response:

{
  "status": "healthy",
  "model_loaded": true,
  "active_connections": 2,
  "gpu_available": true,
  "gpu_name": "NVIDIA GeForce RTX 4090",
  "gpu_memory_allocated": "3.45 GB",
  "gpu_memory_reserved": "4.12 GB",
  "gpu_memory_total": "24.00 GB"
}

Script Format

Your script should follow this format:

Speaker 1: First person's dialogue here.
Speaker 2: Second person's response.
Speaker 1: More dialogue from first person.

Important Notes:

Each speaker line must start with "Speaker" followed by a number
Speaker numbers should be consistent throughout the script
Provide voice names in speaker_names array matching the speaker order
Multi-line dialogue is automatically combined

💡 Usage Examples

Python WebSocket Client

import asyncio
import websockets
import json
import numpy as np
import scipy.io.wavfile as wavfile

async def stream_audio():
    uri = "ws://localhost:8000/ws/generate"
    
    async with websockets.connect(uri) as websocket:
        # Send generation request
        request = {
            "script": "Speaker 1: Welcome to our podcast!\nSpeaker 2: Thanks for having me!",
            "speaker_names": ["en-Alice_woman", "en-Carter_man"],
            "cfg_scale": 1.3,
            "save_file": False
        }
        await websocket.send(json.dumps(request))
        print("Request sent, waiting for response...")
        
        # Receive streaming audio
        audio_chunks = []
        sample_rate = 16000
        
        while True:
            message = await websocket.recv()
            
            if isinstance(message, str):
                # JSON message
                data = json.loads(message)
                
                if data["type"] == "status":
                    print(f"Status: {data['message']}")
                    
                elif data["type"] == "metadata":
                    sample_rate = data["sample_rate"]
                    print(f"Audio format: {sample_rate}Hz, {data['channels']} channel(s)")
                    print(f"Total samples: {data['total_samples']}")
                    
                elif data["type"] == "audio_chunk":
                    print(f"Chunk {data['chunk_num'] + 1}/{data['total_chunks']}")
                    
                elif data["type"] == "complete":
                    print("Generation complete!")
                    break
                    
                elif data["type"] == "error":
                    print(f"Error: {data['message']}")
                    break
            else:
                # Binary audio data
                chunk = np.frombuffer(message, dtype=np.float32)
                audio_chunks.append(chunk)
        
        # Combine all chunks
        if audio_chunks:
            full_audio = np.concatenate(audio_chunks)
            
            # Save to file
            wavfile.write("output.wav", sample_rate, full_audio)
            print(f"Saved audio to output.wav ({len(full_audio)} samples)")

# Run the async function
asyncio.run(stream_audio())

JavaScript WebSocket Client

const ws = new WebSocket('ws://localhost:8000/ws/generate');
const audioChunks = [];
let sampleRate = 16000;

ws.onopen = () => {
    // Send generation request
    ws.send(JSON.stringify({
        script: "Speaker 1: Hello world!\nSpeaker 2: How are you?",
        speaker_names: ["en-Alice_woman", "en-Carter_man"],
        cfg_scale: 1.3,
        save_file: false
    }));
};

ws.onmessage = async (event) => {
    if (typeof event.data === 'string') {
        // JSON message
        const data = JSON.parse(event.data);
        
        if (data.type === 'status') {
            console.log('Status:', data.message);
        } else if (data.type === 'metadata') {
            sampleRate = data.sample_rate;
            console.log(`Audio: ${sampleRate}Hz, ${data.channels} channel(s)`);
        } else if (data.type === 'audio_chunk') {
            console.log(`Chunk ${data.chunk_num + 1}/${data.total_chunks}`);
        } else if (data.type === 'complete') {
            console.log('Generation complete!');
            // Process audio chunks here
            playAudio(audioChunks, sampleRate);
        }
    } else {
        // Binary audio data
        const arrayBuffer = await event.data.arrayBuffer();
        const float32Array = new Float32Array(arrayBuffer);
        audioChunks.push(float32Array);
    }
};

function playAudio(chunks, sampleRate) {
    // Combine chunks and play using Web Audio API
    const audioContext = new AudioContext({sampleRate: sampleRate});
    
    // Calculate total length
    const totalLength = chunks.reduce((sum, chunk) => sum + chunk.length, 0);
    const combinedAudio = new Float32Array(totalLength);
    
    // Combine all chunks
    let offset = 0;
    for (const chunk of chunks) {
        combinedAudio.set(chunk, offset);
        offset += chunk.length;
    }
    
    // Create audio buffer and play
    const audioBuffer = audioContext.createBuffer(1, totalLength, sampleRate);
    audioBuffer.copyToChannel(combinedAudio, 0);
    
    const source = audioContext.createBufferSource();
    source.buffer = audioBuffer;
    source.connect(audioContext.destination);
    source.start();
}

Python REST Client (Batch Mode)

import requests
import time

# Submit batch generation request
response = requests.post("http://localhost:8000/generate/batch", json={
    "script": "Speaker 1: Welcome!\nSpeaker 2: Thanks!",
    "speaker_names": ["en-Alice_woman", "en-Carter_man"],
    "cfg_scale": 1.3,
    "save_file": True
})

result = response.json()
file_id = result["file_id"]
print(f"File ID: {file_id}")

# Download the result
audio_response = requests.get(f"http://localhost:8000/download/{file_id}")
with open("output.wav", "wb") as f:
    f.write(audio_response.content)
print("Audio saved as output.wav")

cURL Examples

# Check health and GPU status
curl "http://localhost:8000/health"

# Batch generation
curl -X POST "http://localhost:8000/generate/batch" \
     -H "Content-Type: application/json" \
     -d '{
       "script": "Speaker 1: Hello!\nSpeaker 2: Hi there!",
       "speaker_names": ["en-Alice_woman", "en-Carter_man"],
       "cfg_scale": 1.3
     }'

# Download generated file
curl -o output.wav "http://localhost:8000/download/YOUR_FILE_ID"

# List available voices
curl "http://localhost:8000/voices"

⚙️ Configuration

Model Selection

By default, the API uses microsoft/VibeVoice-1.5B. To use the 7B model, modify the startup function:

model_path = "WestZhang/VibeVoice-Large-pt"

Streaming Chunk Size

Adjust chunk size for different latency/bandwidth tradeoffs:

generator = StreamingAudioGenerator(model, processor, chunk_size=16000)  # 1 second chunks
# chunk_size=8000   # 0.5 seconds (lower latency)
# chunk_size=32000  # 2 seconds (lower overhead)

Voice Directory

Place your custom voice samples in demo/voices/ directory. Supported format: WAV files.

GPU Optimizations

The API automatically enables:

TF32: Faster computation on Ampere+ GPUs (RTX 3000/4000 series)
cuDNN benchmark: Auto-tunes for optimal performance
bfloat16 precision: Memory-efficient on GPU

To disable optimizations, comment out in startup function:

# torch.backends.cudnn.benchmark = True
# torch.backends.cuda.matmul.allow_tf32 = True
# torch.backends.cudnn.allow_tf32 = True

🚨 Important Notes

GPU Requirements

Minimum: 8GB VRAM for 1.5B model
Recommended: 16GB+ VRAM for 7B model
CPU inference is supported but 10-50x slower

Chinese Speech Stability

For optimal Chinese speech generation:

Use English punctuation (commas and periods only)
Consider using the 7B model for better stability
Avoid special Chinese quotation marks

Background Music

The model may spontaneously generate background music:

Voice samples with BGM increase the likelihood
Introductory phrases ("Welcome to", "Hello") may trigger BGM
Using "Alice" voice preset has higher BGM probability
This is an intentional feature, not a bug

WebSocket Best Practices

Handle disconnections gracefully with reconnection logic
Process audio chunks incrementally for real-time playback
Consider buffering for smoother playback
Monitor memory usage when accumulating chunks

📋 System Architecture

┌─────────────────────────────────────────────────────────┐
│                    FastAPI Application                   │
├─────────────────────────────────────────────────────────┤
│                                                           │
│  ┌─────────────────┐         ┌──────────────────────┐  │
│  │   WebSocket     │         │    REST Endpoints    │  │
│  │   /ws/generate  │         │  /generate/batch     │  │
│  └────────┬────────┘         └──────────┬───────────┘  │
│           │                              │               │
│           └──────────┬───────────────────┘               │
│                      ▼                                   │
│         ┌─────────────────────────┐                     │
│         │ StreamingAudioGenerator │                     │
│         │  - Chunked generation   │                     │
│         │  - Real-time streaming  │                     │
│         └────────────┬────────────┘                     │
│                      ▼                                   │
│         ┌─────────────────────────┐                     │
│         │   VibeVoice Model       │                     │
│         │   (GPU Accelerated)     │                     │
│         │   - TF32 enabled        │                     │
│         │   - cuDNN optimized     │                     │
│         └─────────────────────────┘                     │
│                                                           │
└─────────────────────────────────────────────────────────┘

Key Features:

Real-time WebSocket streaming with chunked delivery
GPU-accelerated inference with automatic optimization
Dual API modes: streaming and batch
Live progress updates during generation
Optional file persistence

🔧 Troubleshooting

GPU Not Detected

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# Check GPU info
nvidia-smi

Out of Memory Errors

Reduce cfg_scale (lower values use less memory)
Use 1.5B model instead of 7B
Close other GPU applications
Reduce chunk size for streaming

WebSocket Connection Issues

Check firewall settings
Ensure port 8000 is accessible
Verify WebSocket protocol (ws:// not http://)
Check for proxy/load balancer WebSocket support

📄 License & Attribution

This project is a streaming API wrapper around Microsoft's VibeVoice model. Please refer to the original VibeVoice repository for licensing terms and model details.

⚠️ Ethical Use & Limitations

Responsible AI Usage:

Disclose AI-generated content when sharing
Ensure compliance with local laws and regulations
Verify content accuracy and avoid misleading applications
Do not use for deepfakes or disinformation
Respect voice actor rights and consent

Technical Limitations:

English and Chinese only
No overlapping speech generation
Speech synthesis only (no background noise/music control)
Streaming latency depends on generation speed
Not recommended for commercial use without additional testing

Model is intended for research and development purposes. Use responsibly.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Figures		Figures
demo		demo
vibevoice		vibevoice
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

mohamed-em2m/VibeVoice_Streaming_API

Folders and files

Latest commit

History

Repository files navigation

🎙️ VibeVoice Streaming API - Real-Time Multi-Speaker TTS

✨ Features

🚀 Quick Start

Prerequisites

Installation

Option 1: Standard Installation with uv (Recommended)

Option 2: Traditional pip Installation

Option 3: Docker Installation with GPU Support

Starting the API Server

📚 API Documentation

Available Models

WebSocket Streaming Endpoint

WebSocket /ws/generate

REST API Endpoints

1. Batch Generation

2. Download Generated File

3. List Available Voices

4. Health Check

Script Format

💡 Usage Examples

Python WebSocket Client

JavaScript WebSocket Client

Python REST Client (Batch Mode)

cURL Examples

⚙️ Configuration

Model Selection

Streaming Chunk Size

Voice Directory

GPU Optimizations

🚨 Important Notes

GPU Requirements

Chinese Speech Stability

Background Music

WebSocket Best Practices

📋 System Architecture

🔧 Troubleshooting

GPU Not Detected

Out of Memory Errors

WebSocket Connection Issues

📄 License & Attribution

⚠️ Ethical Use & Limitations

🤝 Contributing

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

WebSocket `/ws/generate`

Packages