Voxtral CUDA TTS

Text-to-speech service using Mistral's Voxtral-4B-TTS with CUDA acceleration via vLLM-Omni.

Requirements

NVIDIA GPU with >= 28GB VRAM (e.g. RTX 3090, RTX 4090, RTX 5090, A100)
Docker with NVIDIA Container Toolkit

Note: Voxtral 4B is a large model that uses ~28GB VRAM at runtime (two vLLM EngineCore processes). Make sure no other GPU-heavy processes (Ollama, other models) are running. The GPU_MEMORY_UTILIZATION env var controls the fraction of VRAM allocated (default 0.95).

Quick Start

docker compose up

The model downloads automatically from HuggingFace on first run (~8GB) and is cached in a Docker volume. Subsequent starts load from cache. Once ready:

Web UI: http://localhost:8880
API Docs: http://localhost:8880/docs
API: POST http://localhost:8880/v1/audio/speech
Voices: GET http://localhost:8880/v1/voices
Health: GET http://localhost:8880/health

API Usage

curl -X POST http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, this is Voxtral speaking!",
    "voice": "neutral_female",
    "response_format": "wav",
    "language": "English"
  }' --output output.wav

Parameters

Parameter	Default	Description
`input`	(required)	Text to synthesize
`voice`	`neutral_female`	Voice name
`speed`	`1.0`	Speed (0.25–4.0)
`response_format`	`wav`	wav, mp3, opus, flac, pcm
`bitrate`	`192k`	Bitrate for mp3/opus
`language`	`Auto`	Auto, English, French, German, Spanish, Italian, Portuguese, Dutch, Hindi, Arabic

Available Voices

ar_male, casual_female, casual_male, cheerful_female, de_female, de_male, es_female, es_male, fr_female, fr_male, hi_female, hi_male, it_female, it_male, neutral_female, neutral_male, nl_female, nl_male, pt_female, pt_male

Architecture

[Client] → :8880 FastAPI (web UI + audio encoding)
                    ↓
            :8091 vLLM-Omni (model inference, internal)

The entrypoint starts vLLM-Omni as a backend on port 8091, then launches the FastAPI wrapper on port 8880 which adds the web UI and handles audio format conversion.

Notes

Barely fits on an RTX 5090 (32GB). Boot time is very slow (~2 min) due to model loading and vLLM initialization. Once running, response times are good (short text ~400ms, medium ~1.3s) and the voice quality is on par with ElevenLabs.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
benchmark		benchmark
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voxtral CUDA TTS

Requirements

Quick Start

API Usage

Parameters

Available Voices

Architecture

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Voxtral CUDA TTS

Requirements

Quick Start

API Usage

Parameters

Available Voices

Architecture

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages