Skip to content

this is impelmtion to how use vibevoice with streaming capabilites

License

Notifications You must be signed in to change notification settings

mohamed-em2m/VibeVoice_Streaming_API

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎙️ VibeVoice Streaming API - Real-Time Multi-Speaker TTS

FastAPI WebSocket Python Original Project

A production-ready streaming API for Microsoft's VibeVoice model, enabling high-quality multi-speaker text-to-speech with real-time WebSocket streaming and GPU acceleration.

✨ Features

  • 🔄 Real-Time Streaming: WebSocket-based audio streaming with chunked delivery
  • GPU Accelerated: Optimized for NVIDIA GPUs with TF32 and cuDNN optimizations
  • 🎯 Multi-Speaker TTS: Generate conversations with up to 4 distinct speakers
  • 📊 Live Progress Updates: Real-time status updates during generation
  • 🔄 Dual API Modes: WebSocket streaming + REST batch endpoints
  • 🎵 Voice Presets: Pre-configured voice samples for immediate use
  • 🐳 Docker Ready: Easy deployment with containerization support
  • 💾 Optional File Saving: Save complete audio files alongside streaming

🚀 Quick Start

Prerequisites

  • Python 3.8+
  • NVIDIA GPU with CUDA support (highly recommended)
  • At least 8GB GPU memory for optimal performance

Installation

Option 1: Standard Installation with uv (Recommended)

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone this repository
git clone https://github.com/dontriskit/VibeVoice-Streaming-API
cd VibeVoice-Streaming-API

# Create and activate virtual environment with uv
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
uv pip install -r requirements.txt
uv pip install -e .

# Install WebSocket support
uv pip install websockets

Option 2: Traditional pip Installation

# Clone this repository
git clone https://github.com/dontriskit/VibeVoice-Streaming-API
cd VibeVoice-Streaming-API

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
pip install -e .
pip install websockets

Option 3: Docker Installation with GPU Support

# Using NVIDIA PyTorch Container
sudo docker run --privileged --net=host --ipc=host \
  --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all \
  --rm -it nvcr.io/nvidia/pytorch:24.07-py3

# Inside container
git clone <your-repo-url>
cd VibeVoice-Streaming-API
pip install -r requirements.txt
pip install -e .
pip install websockets

Starting the API Server

# Basic usage (defaults to port 8000)
python main.py

# Custom host and port
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

# With custom number of workers (not recommended for GPU)
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1

The API will be available at:

  • REST API: http://localhost:8000
  • WebSocket: ws://localhost:8000/ws/generate
  • Interactive Docs: http://localhost:8000/docs

📚 API Documentation

Available Models

Model Context Length Generation Length Hugging Face
VibeVoice-1.5B 64K ~90 min microsoft/VibeVoice-1.5B
VibeVoice-7B-Preview 32K ~45 min WestZhang/VibeVoice-Large-pt

WebSocket Streaming Endpoint

WebSocket /ws/generate

Real-time audio streaming with live progress updates.

Connection Flow:

  1. Connect to WebSocket endpoint
  2. Send generation request as JSON
  3. Receive metadata and status updates
  4. Stream audio chunks in real-time
  5. Receive completion notification

Request Message:

{
  "script": "Speaker 1: Hello, how are you today?\nSpeaker 2: I'm doing great!",
  "speaker_names": ["en-Alice_woman", "en-Carter_man"],
  "cfg_scale": 1.3,
  "save_file": false
}

Response Message Types:

  1. Status Update
{
  "type": "status",
  "message": "Processing input..."
}
  1. Metadata (sent once before audio)
{
  "type": "metadata",
  "sample_rate": 16000,
  "total_samples": 160000,
  "channels": 1,
  "dtype": "float32"
}
  1. Audio Chunk (JSON + binary data)
{
  "type": "audio_chunk",
  "chunk_num": 0,
  "total_chunks": 10,
  "samples": 16000
}

Followed by binary audio data as float32 bytes

  1. Completion
{
  "type": "complete",
  "message": "Audio generation complete",
  "total_chunks": 10
}
  1. File Saved (if save_file=true)
{
  "type": "file_saved",
  "path": "api_outputs/session-id.wav",
  "session_id": "123e4567-e89b-12d3-a456-426614174000"
}
  1. Error
{
  "type": "error",
  "message": "Error description"
}

REST API Endpoints

1. Batch Generation

POST /generate/batch

Non-streaming batch generation endpoint (backward compatibility).

Request Body:

{
  "script": "Speaker 1: Hello world!\nSpeaker 2: How are you?",
  "speaker_names": ["en-Alice_woman", "en-Carter_man"],
  "cfg_scale": 1.3,
  "save_file": true
}

Response:

{
  "file_id": "123e4567-e89b-12d3-a456-426614174000",
  "status": "completed",
  "download_url": "/download/123e4567-e89b-12d3-a456-426614174000"
}

Rate Limit: 5 requests per minute per IP

2. Download Generated File

GET /download/{file_id}

Download a previously generated audio file.

Response: WAV audio file

3. List Available Voices

GET /voices

Get all available voice presets.

Response:

{
  "voices": [
    "en-Alice_woman",
    "en-Carter_man",
    "en-Frank_man",
    "en-Mary_woman_bgm",
    "zh-Bowen_man",
    "zh-Xinran_woman"
  ],
  "count": 6
}

4. Health Check

GET /health

Check API status and GPU information.

Response:

{
  "status": "healthy",
  "model_loaded": true,
  "active_connections": 2,
  "gpu_available": true,
  "gpu_name": "NVIDIA GeForce RTX 4090",
  "gpu_memory_allocated": "3.45 GB",
  "gpu_memory_reserved": "4.12 GB",
  "gpu_memory_total": "24.00 GB"
}

Script Format

Your script should follow this format:

Speaker 1: First person's dialogue here.
Speaker 2: Second person's response.
Speaker 1: More dialogue from first person.

Important Notes:

  • Each speaker line must start with "Speaker" followed by a number
  • Speaker numbers should be consistent throughout the script
  • Provide voice names in speaker_names array matching the speaker order
  • Multi-line dialogue is automatically combined

💡 Usage Examples

Python WebSocket Client

import asyncio
import websockets
import json
import numpy as np
import scipy.io.wavfile as wavfile

async def stream_audio():
    uri = "ws://localhost:8000/ws/generate"
    
    async with websockets.connect(uri) as websocket:
        # Send generation request
        request = {
            "script": "Speaker 1: Welcome to our podcast!\nSpeaker 2: Thanks for having me!",
            "speaker_names": ["en-Alice_woman", "en-Carter_man"],
            "cfg_scale": 1.3,
            "save_file": False
        }
        await websocket.send(json.dumps(request))
        print("Request sent, waiting for response...")
        
        # Receive streaming audio
        audio_chunks = []
        sample_rate = 16000
        
        while True:
            message = await websocket.recv()
            
            if isinstance(message, str):
                # JSON message
                data = json.loads(message)
                
                if data["type"] == "status":
                    print(f"Status: {data['message']}")
                    
                elif data["type"] == "metadata":
                    sample_rate = data["sample_rate"]
                    print(f"Audio format: {sample_rate}Hz, {data['channels']} channel(s)")
                    print(f"Total samples: {data['total_samples']}")
                    
                elif data["type"] == "audio_chunk":
                    print(f"Chunk {data['chunk_num'] + 1}/{data['total_chunks']}")
                    
                elif data["type"] == "complete":
                    print("Generation complete!")
                    break
                    
                elif data["type"] == "error":
                    print(f"Error: {data['message']}")
                    break
            else:
                # Binary audio data
                chunk = np.frombuffer(message, dtype=np.float32)
                audio_chunks.append(chunk)
        
        # Combine all chunks
        if audio_chunks:
            full_audio = np.concatenate(audio_chunks)
            
            # Save to file
            wavfile.write("output.wav", sample_rate, full_audio)
            print(f"Saved audio to output.wav ({len(full_audio)} samples)")

# Run the async function
asyncio.run(stream_audio())

JavaScript WebSocket Client

const ws = new WebSocket('ws://localhost:8000/ws/generate');
const audioChunks = [];
let sampleRate = 16000;

ws.onopen = () => {
    // Send generation request
    ws.send(JSON.stringify({
        script: "Speaker 1: Hello world!\nSpeaker 2: How are you?",
        speaker_names: ["en-Alice_woman", "en-Carter_man"],
        cfg_scale: 1.3,
        save_file: false
    }));
};

ws.onmessage = async (event) => {
    if (typeof event.data === 'string') {
        // JSON message
        const data = JSON.parse(event.data);
        
        if (data.type === 'status') {
            console.log('Status:', data.message);
        } else if (data.type === 'metadata') {
            sampleRate = data.sample_rate;
            console.log(`Audio: ${sampleRate}Hz, ${data.channels} channel(s)`);
        } else if (data.type === 'audio_chunk') {
            console.log(`Chunk ${data.chunk_num + 1}/${data.total_chunks}`);
        } else if (data.type === 'complete') {
            console.log('Generation complete!');
            // Process audio chunks here
            playAudio(audioChunks, sampleRate);
        }
    } else {
        // Binary audio data
        const arrayBuffer = await event.data.arrayBuffer();
        const float32Array = new Float32Array(arrayBuffer);
        audioChunks.push(float32Array);
    }
};

function playAudio(chunks, sampleRate) {
    // Combine chunks and play using Web Audio API
    const audioContext = new AudioContext({sampleRate: sampleRate});
    
    // Calculate total length
    const totalLength = chunks.reduce((sum, chunk) => sum + chunk.length, 0);
    const combinedAudio = new Float32Array(totalLength);
    
    // Combine all chunks
    let offset = 0;
    for (const chunk of chunks) {
        combinedAudio.set(chunk, offset);
        offset += chunk.length;
    }
    
    // Create audio buffer and play
    const audioBuffer = audioContext.createBuffer(1, totalLength, sampleRate);
    audioBuffer.copyToChannel(combinedAudio, 0);
    
    const source = audioContext.createBufferSource();
    source.buffer = audioBuffer;
    source.connect(audioContext.destination);
    source.start();
}

Python REST Client (Batch Mode)

import requests
import time

# Submit batch generation request
response = requests.post("http://localhost:8000/generate/batch", json={
    "script": "Speaker 1: Welcome!\nSpeaker 2: Thanks!",
    "speaker_names": ["en-Alice_woman", "en-Carter_man"],
    "cfg_scale": 1.3,
    "save_file": True
})

result = response.json()
file_id = result["file_id"]
print(f"File ID: {file_id}")

# Download the result
audio_response = requests.get(f"http://localhost:8000/download/{file_id}")
with open("output.wav", "wb") as f:
    f.write(audio_response.content)
print("Audio saved as output.wav")

cURL Examples

# Check health and GPU status
curl "http://localhost:8000/health"

# Batch generation
curl -X POST "http://localhost:8000/generate/batch" \
     -H "Content-Type: application/json" \
     -d '{
       "script": "Speaker 1: Hello!\nSpeaker 2: Hi there!",
       "speaker_names": ["en-Alice_woman", "en-Carter_man"],
       "cfg_scale": 1.3
     }'

# Download generated file
curl -o output.wav "http://localhost:8000/download/YOUR_FILE_ID"

# List available voices
curl "http://localhost:8000/voices"

⚙️ Configuration

Model Selection

By default, the API uses microsoft/VibeVoice-1.5B. To use the 7B model, modify the startup function:

model_path = "WestZhang/VibeVoice-Large-pt"

Streaming Chunk Size

Adjust chunk size for different latency/bandwidth tradeoffs:

generator = StreamingAudioGenerator(model, processor, chunk_size=16000)  # 1 second chunks
# chunk_size=8000   # 0.5 seconds (lower latency)
# chunk_size=32000  # 2 seconds (lower overhead)

Voice Directory

Place your custom voice samples in demo/voices/ directory. Supported format: WAV files.

GPU Optimizations

The API automatically enables:

  • TF32: Faster computation on Ampere+ GPUs (RTX 3000/4000 series)
  • cuDNN benchmark: Auto-tunes for optimal performance
  • bfloat16 precision: Memory-efficient on GPU

To disable optimizations, comment out in startup function:

# torch.backends.cudnn.benchmark = True
# torch.backends.cuda.matmul.allow_tf32 = True
# torch.backends.cudnn.allow_tf32 = True

🚨 Important Notes

GPU Requirements

  • Minimum: 8GB VRAM for 1.5B model
  • Recommended: 16GB+ VRAM for 7B model
  • CPU inference is supported but 10-50x slower

Chinese Speech Stability

For optimal Chinese speech generation:

  • Use English punctuation (commas and periods only)
  • Consider using the 7B model for better stability
  • Avoid special Chinese quotation marks

Background Music

The model may spontaneously generate background music:

  • Voice samples with BGM increase the likelihood
  • Introductory phrases ("Welcome to", "Hello") may trigger BGM
  • Using "Alice" voice preset has higher BGM probability
  • This is an intentional feature, not a bug

WebSocket Best Practices

  • Handle disconnections gracefully with reconnection logic
  • Process audio chunks incrementally for real-time playback
  • Consider buffering for smoother playback
  • Monitor memory usage when accumulating chunks

📋 System Architecture

┌─────────────────────────────────────────────────────────┐
│                    FastAPI Application                   │
├─────────────────────────────────────────────────────────┤
│                                                           │
│  ┌─────────────────┐         ┌──────────────────────┐  │
│  │   WebSocket     │         │    REST Endpoints    │  │
│  │   /ws/generate  │         │  /generate/batch     │  │
│  └────────┬────────┘         └──────────┬───────────┘  │
│           │                              │               │
│           └──────────┬───────────────────┘               │
│                      ▼                                   │
│         ┌─────────────────────────┐                     │
│         │ StreamingAudioGenerator │                     │
│         │  - Chunked generation   │                     │
│         │  - Real-time streaming  │                     │
│         └────────────┬────────────┘                     │
│                      ▼                                   │
│         ┌─────────────────────────┐                     │
│         │   VibeVoice Model       │                     │
│         │   (GPU Accelerated)     │                     │
│         │   - TF32 enabled        │                     │
│         │   - cuDNN optimized     │                     │
│         └─────────────────────────┘                     │
│                                                           │
└─────────────────────────────────────────────────────────┘

Key Features:

  • Real-time WebSocket streaming with chunked delivery
  • GPU-accelerated inference with automatic optimization
  • Dual API modes: streaming and batch
  • Live progress updates during generation
  • Optional file persistence

🔧 Troubleshooting

GPU Not Detected

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# Check GPU info
nvidia-smi

Out of Memory Errors

  • Reduce cfg_scale (lower values use less memory)
  • Use 1.5B model instead of 7B
  • Close other GPU applications
  • Reduce chunk size for streaming

WebSocket Connection Issues

  • Check firewall settings
  • Ensure port 8000 is accessible
  • Verify WebSocket protocol (ws:// not http://)
  • Check for proxy/load balancer WebSocket support

📄 License & Attribution

This project is a streaming API wrapper around Microsoft's VibeVoice model. Please refer to the original VibeVoice repository for licensing terms and model details.

⚠️ Ethical Use & Limitations

Responsible AI Usage:

  • Disclose AI-generated content when sharing
  • Ensure compliance with local laws and regulations
  • Verify content accuracy and avoid misleading applications
  • Do not use for deepfakes or disinformation
  • Respect voice actor rights and consent

Technical Limitations:

  • English and Chinese only
  • No overlapping speech generation
  • Speech synthesis only (no background noise/music control)
  • Streaming latency depends on generation speed
  • Not recommended for commercial use without additional testing

Model is intended for research and development purposes. Use responsibly.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

About

this is impelmtion to how use vibevoice with streaming capabilites

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages