Skip to content

ffaerber/voxtral-cuda

Repository files navigation

Voxtral CUDA TTS

Docker Hub

Text-to-speech service using Mistral's Voxtral-4B-TTS with CUDA acceleration via vLLM-Omni.

Requirements

  • NVIDIA GPU with >= 28GB VRAM (e.g. RTX 3090, RTX 4090, RTX 5090, A100)
  • Docker with NVIDIA Container Toolkit

Note: Voxtral 4B is a large model that uses ~28GB VRAM at runtime (two vLLM EngineCore processes). Make sure no other GPU-heavy processes (Ollama, other models) are running. The GPU_MEMORY_UTILIZATION env var controls the fraction of VRAM allocated (default 0.95).

Quick Start

docker compose up

The model downloads automatically from HuggingFace on first run (~8GB) and is cached in a Docker volume. Subsequent starts load from cache. Once ready:

API Usage

curl -X POST http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, this is Voxtral speaking!",
    "voice": "neutral_female",
    "response_format": "wav",
    "language": "English"
  }' --output output.wav

Parameters

Parameter Default Description
input (required) Text to synthesize
voice neutral_female Voice name
speed 1.0 Speed (0.25–4.0)
response_format wav wav, mp3, opus, flac, pcm
bitrate 192k Bitrate for mp3/opus
language Auto Auto, English, French, German, Spanish, Italian, Portuguese, Dutch, Hindi, Arabic

Available Voices

ar_male, casual_female, casual_male, cheerful_female, de_female, de_male, es_female, es_male, fr_female, fr_male, hi_female, hi_male, it_female, it_male, neutral_female, neutral_male, nl_female, nl_male, pt_female, pt_male

Architecture

[Client] → :8880 FastAPI (web UI + audio encoding)
                    ↓
            :8091 vLLM-Omni (model inference, internal)

The entrypoint starts vLLM-Omni as a backend on port 8091, then launches the FastAPI wrapper on port 8880 which adds the web UI and handles audio format conversion.

Notes

Barely fits on an RTX 5090 (32GB). Boot time is very slow (~2 min) due to model loading and vLLM initialization. Once running, response times are good (short text ~400ms, medium ~1.3s) and the voice quality is on par with ElevenLabs.

About

text to speech

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors