Text-to-speech service using Mistral's Voxtral-4B-TTS with CUDA acceleration via vLLM-Omni.
- NVIDIA GPU with >= 28GB VRAM (e.g. RTX 3090, RTX 4090, RTX 5090, A100)
- Docker with NVIDIA Container Toolkit
Note: Voxtral 4B is a large model that uses ~28GB VRAM at runtime (two vLLM EngineCore processes). Make sure no other GPU-heavy processes (Ollama, other models) are running. The
GPU_MEMORY_UTILIZATIONenv var controls the fraction of VRAM allocated (default0.95).
docker compose upThe model downloads automatically from HuggingFace on first run (~8GB) and is cached in a Docker volume. Subsequent starts load from cache. Once ready:
- Web UI: http://localhost:8880
- API Docs: http://localhost:8880/docs
- API: POST http://localhost:8880/v1/audio/speech
- Voices: GET http://localhost:8880/v1/voices
- Health: GET http://localhost:8880/health
curl -X POST http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, this is Voxtral speaking!",
"voice": "neutral_female",
"response_format": "wav",
"language": "English"
}' --output output.wav| Parameter | Default | Description |
|---|---|---|
input |
(required) | Text to synthesize |
voice |
neutral_female |
Voice name |
speed |
1.0 |
Speed (0.25–4.0) |
response_format |
wav |
wav, mp3, opus, flac, pcm |
bitrate |
192k |
Bitrate for mp3/opus |
language |
Auto |
Auto, English, French, German, Spanish, Italian, Portuguese, Dutch, Hindi, Arabic |
ar_male, casual_female, casual_male, cheerful_female, de_female, de_male, es_female, es_male, fr_female, fr_male, hi_female, hi_male, it_female, it_male, neutral_female, neutral_male, nl_female, nl_male, pt_female, pt_male
[Client] → :8880 FastAPI (web UI + audio encoding)
↓
:8091 vLLM-Omni (model inference, internal)
The entrypoint starts vLLM-Omni as a backend on port 8091, then launches the FastAPI wrapper on port 8880 which adds the web UI and handles audio format conversion.
Barely fits on an RTX 5090 (32GB). Boot time is very slow (~2 min) due to model loading and vLLM initialization. Once running, response times are good (short text ~400ms, medium ~1.3s) and the voice quality is on par with ElevenLabs.