An open-source, sophisticated multi-model AI audio generation platform
Integrating state-of-the-art voice conversion, SFX generation, and text-to-audio models into a seamless, high-fidelity experience.
VOX is a modular open-source AI audio platform that brings together state-of-the-art models for:
- Voice conversion & cloning
- Multilingual text-to-speech
- Text-to-audio & sound effects generation
One command sets up everything — environments, model weights, dependencies, and database:
chmod +x init.sh
./init.sh- Next.js 15 (App Router)
- TypeScript
- Tailwind CSS
- Zustand
- Tanstack Query
- Node.js 20+
- Drizzle ORM
- p-queue
- Seed-VC — Zero-shot voice conversion & cloning
- Make-An-Audio — Text-to-audio generation
- XTTS-v2 — High-quality multilingual TTS
- Bash orchestration
- Python-based environment & model manager
├── packages/
│ ├── app/ # Next.js frontend
│ └── server/ # Backend API & database
├── models/
│ ├── seed-vc/ # Voice conversion
│ ├── make-an-audio/# Audio generation
│ └── xtts-v2/ # Text-to-speech
├── data/ # Audio assets & outputs
└── init.sh # One-command setup
- OS: macOS (MPS) or Linux (CUDA)
- Python: 3.10+
- Node.js: 20+
- GPU: Recommended (CPU supported with reduced performance)
MIT — free to use, modify, and distribute.