A web-based Text-to-Speech application supporting multiple TTS engines including Kokoro-82M and Chatterbox, with both local GPU inference and Replicate cloud API options for generating multi-voice audiobooks and stories.
- Multi-Engine Support: Choose from four TTS engine options:
- Kokoro · Local GPU - Run Kokoro-82M locally on your NVIDIA GPU
- Kokoro · Replicate - Use Kokoro via Replicate cloud API
- Chatterbox · Local GPU - Run Chatterbox locally with voice cloning (~8GB VRAM required)
- Chatterbox · Replicate - Use Chatterbox via Replicate cloud API (
resemble-ai/chatterbox-turbo)
- Unified Replicate API: Single API token works for both Kokoro and Chatterbox Replicate engines
- Chatterbox Voice Cloning: Upload your own voice recordings (10-15 seconds recommended) to clone any voice
- Chatterbox Voice Management: Add, rename, delete, and preview custom voice prompts with drag-and-drop bulk upload
- Multi-Voice Support: Use Kokoro-82M voices for any number of characters in your story
- Custom Voice Blending: Mix any combination of Kokoro voices with weighted ratios to create reusable "custom_*" voice codes
- Speaker Tags & Auto Detection: Automatically parse
[speaker1]...[/speaker1]or[alice]...[/alice]tags - Smart Text Chunking: Automatically splits long texts into manageable chunks
- Seamless Audio Merging: Merges chunks into a single file with configurable crossfade
- Intro & Inter-Segment Silence Controls: Dial in precise empty space before the first line and between chunks
- Gemini Pre-Processing: Automatically decides between whole-text or chapter-based Gemini runs with speaker-memory context
- Speaker Memory Between Chunks: Gemini requests carry forward discovered speaker tags for consistency
- Local GPU Processing: Run entirely on your machine for privacy and speed
- Cloud API Option: Use Replicate API when you don't have local GPU resources
- Job Queue: Submit multiple jobs, track real-time progress with ETA, cancel, and download results
- Job Queue Tab: Dedicated UI to monitor all jobs with progress bars and chunk counts
- Audio Library: Browsable list of all completed outputs with inline players, engine indicator showing which TTS engine was used, and delete/clear controls
- Chapter Collections + Full Audiobook: Toggle per-chapter outputs and optionally create a single combined audiobook
- Available Voices & Previews: Browse all Kokoro voices grouped by language, generate preview samples
- Configurable Settings: Control TTS engine, speed, chunk size, output format, bitrate, crossfade
- Dynamic Gemini Controls: Save your Gemini API key, fetch the latest available Gemini models on demand
- Web Interface: Modern single-page UI built with Flask and vanilla JS
TTS-Story exposes the full Kokoro-82M voice set, grouped by language.
- Female:
af_alloy,af_aoede,af_bella,af_heart,af_jessica,af_kore,af_nicole,af_nova,af_river,af_sarah,af_sky - Male:
am_adam,am_echo,am_eric,am_fenrir,am_liam,am_michael,am_onyx,am_puck,am_santa
- Female:
bf_alice,bf_emma,bf_isabella,bf_lily - Male:
bm_daniel,bm_fable,bm_george,bm_lewis
ef_dora,em_alex,em_santa
ff_siwis
hf_alpha,hf_beta,hm_omega
jf_alpha,jf_gongitsune,jf_nezumi,jf_tebukuro,jm_kumo
zf_xiaobei,zf_xiaoni,zf_xiaoxiao,zf_xiaoyi
pf_dora,pm_alex,pm_santa
All of these voices are browsable in the Available Voices tab, where you can generate and play preview samples.
Chatterbox supports voice cloning from audio recordings. To add custom voices:
- Go to the Available Voices tab and scroll to the Chatterbox Available Voices section
- Upload a voice recording (WAV, MP3, M4A, FLAC, or OGG format)
- Recommended duration: 10-15 seconds of clear speech
- Avoid background noise for best results
- Give the voice a descriptive name and click Save Voice
- Your custom voices appear in all Chatterbox voice dropdowns, sorted alphabetically
You can also drag-and-drop multiple audio files for bulk upload. Each voice can be previewed, renamed, or deleted from the management interface.
- Python 3.9 or higher
- NVIDIA GPU with CUDA support (optional, for local GPU inference)
- Internet connection (for downloading dependencies)
- Clone or download the repository
git clone <your-repo-url>
cd TTS-Story- Run the setup script
setup.batThe setup script will automatically:
- ✅ Detect your Python version
- ✅ Create a Python virtual environment
- ✅ Detect your NVIDIA GPU and CUDA version
- ✅ Install PyTorch with appropriate CUDA support (or CPU-only if no GPU)
- ✅ Download and install espeak-ng automatically
- ✅ Install all other required dependencies
- ✅ Download the Rubber Band CLI and wire it up for high-quality pitch/tempo FX
- ✅ Verify the installation
Supported CUDA Versions:
- CUDA 12.9, 12.8, 12.6, 12.4, 12.1
- CUDA 11.8
- CPU-only (automatic fallback if no GPU detected)
- Start the application
run.bat- Open your browser
http://localhost:5000
If you prefer to install manually or the automatic setup fails:
-
Install espeak-ng
- Download from espeak-ng releases
- Install the
espeak-ng-X64.msifile for Windows
-
Install Rubber Band CLI (for pitch/tempo FX quality)
- Download the Windows zip from breakfastquay.com/rubberband
- Extract it and add the folder containing
rubberband.exeto yourPATH
-
Create virtual environment
python -m venv venv
venv\Scripts\activate- Install PyTorch with CUDA support
# For CUDA 12.1 (most common)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# For CPU only
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu- Install other dependencies
pip install -r requirements.txt- Run the application
python app.pyTTS-Story supports four TTS engine options. In the Settings tab, choose your preferred default engine:
| Engine | Description | Requirements |
|---|---|---|
| Kokoro · Local GPU | Run Kokoro-82M locally | NVIDIA GPU with CUDA |
| Kokoro · Replicate | Kokoro via cloud API | Replicate API token |
| Chatterbox · Local GPU | Chatterbox with voice cloning | NVIDIA GPU (~8GB VRAM) |
| Chatterbox · Replicate | Chatterbox via cloud API | Replicate API token |
You can also override the engine per-job in the Generate tab.
- Open your browser to
http://localhost:5000 - In Settings, select your preferred TTS engine
- If using Replicate engines, enter your API token in the Replicate API section
- Paste your text with or without speaker tags in the Generate tab
- Select a Default Voice (used for plain text / unassigned speakers)
- If you use speaker tags, TTS-Story automatically analyzes the text and lets you assign voices per speaker
- Click Generate Audio
- The job is added to the Job Queue, processed in the background, and the result appears in:
- Job Queue tab (with real-time progress, ETA, and player)
- Library tab (all past generations with engine indicator)
When using a Chatterbox engine:
- First, add voice recordings in Available Voices → Chatterbox Available Voices
- In the Generate tab, select your cloned voice from the Reference Voice dropdown
- Each speaker can use a different cloned voice for multi-character stories
- A shared "Quick Test Text" field lives above the Assigned Voices section so you can type once and preview any speaker with matching FX.
- Each speaker row includes an inline Quick Test button beside the tone controls.
Note: Local GPU modes run entirely on your machine and never use cloud APIs, ensuring complete privacy and no API costs.
- Intro Silence (ms): Adds empty space before the very first spoken line
- Silence Between Segments (ms): Inserts a gap after each chunk/line before the next one begins
- Both settings are configurable in the Generation Settings panel (0–2000 ms, 100 ms steps)
Both Kokoro · Replicate and Chatterbox · Replicate use the same API token:
- Get your API key from Replicate (starts with
r8_...) - In Settings, enter your token in the Replicate API section
- Click Save Settings
- Select either Replicate engine from the dropdown
You can use either numbered speakers or named speakers:
Numbered Format:
[speaker1]Hello, my name is Alice.[/speaker1]
[speaker2]Nice to meet you, Alice! I'm Bob.[/speaker2]
[speaker1]It's great to meet you too![/speaker1]
Named Format:
[narrator]Once upon a time, in a land far away...[/narrator]
[alice]Hello, my name is Alice.[/alice]
[bob]Nice to meet you, Alice! I'm Bob.[/bob]
[narrator]And so their adventure began.[/narrator]
You can use any alphanumeric name (letters, numbers, underscores). The system will automatically detect all unique speakers and let you assign voices to each one.
Need to tidy a manuscript or add consistent speaker tags before running TTS? Use the Prep Text with Gemini button:
- Enter your Gemini API key and model in Settings, then click Fetch Available Models if you want to load the latest list directly from Google.
- Paste your story in the Generate tab and decide whether "Generate separate audio files for each chapter" should be enabled.
- Select a Prompt Preset (see below) or write your own custom prompt.
- Click Prep Text with Gemini:
- If chapter splitting is enabled, TTS-Story reuses the detected chapter list and sends each one to Gemini separately with your pre-prompt and the running speaker list.
- If chapter splitting is disabled, the whole manuscript (plus pre-prompt) is sent in a single Gemini request to respect the context window.
- A real-time progress bar shows which chapter or full-text step is running.
- When Gemini finishes, the cleaned/expanded narrative replaces the input field. Chapter headings stay inside the narrator tags so audio splitting still works.
- Re-run Analyze Text if needed. Your voice assignments and FX settings remain untouched unless you explicitly reset them.
Because the speaker list is tracked across sections, characters that appear later continue to use the same tag, which keeps the voice assignment UI tidy and prevents duplicate dropdowns.
TTS-Story includes three pre-configured Gemini prompt presets optimized for different use cases:
| Preset | Best For | Description |
|---|---|---|
| Chatterbox Natural Dialogue Conversation | Chatterbox engines | Transforms text into natural-sounding dialogue with paralinguistic tags (laughter, sighs, pauses) and human speech quirks. Ideal for conversational content where you want expressive, lifelike output. |
| Chatterbox Audio Book Conversion | Chatterbox engines | Maintains strict adherence to the original text while converting symbols and abbreviations that TTS engines struggle with into speakable words (e.g., "/" → "slash", "-" → "dash", "Dr." → "Doctor"). |
| Kokoro Audio Book Conversion | Kokoro engines | Preserves the exact text of the book while adding speaker tags and preparing the content for TTS conversion. Focuses on accurate speaker identification and proper text segmentation without modifying the original prose. |
Select a preset from the dropdown in the Gemini section, or create your own custom prompts and save them for reuse.
If no speaker tags are found, the entire text will be processed with a single voice.
- Job Queue tab shows all jobs with:
- Real-time progress bars and chunk counts
- ETA estimates during processing
- Status indicators (
queued,processing,completed,failed,cancelled) - Per-job controls (cancel, download)
- Library tab lists all completed outputs (sorted newest first) with:
- Engine indicator showing which TTS engine was used (Kokoro, Kokoro Replicate, Chatterbox, Chatterbox Replicate)
- Inline audio players
- Download links
- Delete and "Clear All" controls
- Available Voices tab lists all Kokoro-82M voices grouped by language.
- You can:
- Generate preview samples for all voices
- Regenerate (overwrite) samples if you change text or update voices
- Click any voice to play its preview sample
- Open the Custom Voice Blends panel inside the Available Voices tab to create bespoke voices.
- Click New Custom Voice (or Edit on any card) to open the modal where you can:
- Name the blend and choose its language group (lang_code)
- Add one or more component voices and set their mix weights (e.g., 0.5 narrator + 0.5 af_heart)
- Optionally add notes for future reference
- Saved blends appear in the grid with metadata (code, language, updated time) and can be edited or deleted at any time.
- All custom voices automatically show up in:
- Default voice dropdowns
- Per-speaker assignment selects (grouped by language under “Custom Blends” optgroups)
/api/voicesresponses (custom_voicesarrays per language) so automation scripts can use them.
- When the generator encounters a
custom_*voice, the backend blends the component embeddings on the fly and caches the tensor for fast reuse.
Tip: The API exposes the full CRUD workflow under
/api/custom-voices, so you can script voice creation or keep predefined blends in source control.
Settings are stored in config.json:
{
"replicate_api_key": "",
"chunk_size": 500,
"sample_rate": 24000,
"speed": 1.0,
"output_format": "mp3",
"output_bitrate_kbps": 128,
"crossfade_duration": 0.1,
"intro_silence_ms": 0,
"inter_chunk_silence_ms": 0,
"tts_engine": "kokoro",
"chatterbox_turbo_local_device": "auto",
"chatterbox_turbo_local_temperature": 0.8,
"chatterbox_turbo_replicate_model": "resemble-ai/chatterbox-turbo",
"gemini_api_key": "",
"gemini_model": "gemini-2.0-flash"
}| Value | Description |
|---|---|
kokoro |
Kokoro-82M local GPU inference |
kokoro_replicate |
Kokoro via Replicate cloud API |
chatterbox_turbo_local |
Chatterbox local GPU with voice cloning |
chatterbox_turbo_replicate |
Chatterbox via Replicate cloud API |
Any settings you override in the Generate tab (format, bitrate, engine) are sent along with the job payload while keeping the saved defaults intact.
TTS-Story/
├── app.py # Flask web server
├── requirements.txt # Python dependencies
├── setup.bat # Windows setup script
├── run.bat # Windows run script
├── config.json # Configuration file
├── src/
│ ├── tts_engine.py # TTS engine registry and factory
│ ├── replicate_api.py # Replicate API integration (Kokoro)
│ ├── text_processor.py # Text chunking and parsing
│ ├── audio_merger.py # Audio file merging
│ ├── voice_manager.py # Voice configuration and preview sample metadata
│ ├── voice_sample_generator.py # Batch generation of voice preview samples
│ └── engines/
│ ├── kokoro_engine.py # Kokoro-82M local engine
│ ├── chatterbox_turbo_local_engine.py # Chatterbox local GPU engine
│ └── chatterbox_turbo_replicate_engine.py # Chatterbox Replicate engine
├── static/
│ ├── css/
│ │ └── style.css
│ ├── js/
│ │ ├── main.js
│ │ ├── queue.js
│ │ ├── library.js
│ │ ├── voice-manager.js
│ │ └── settings.js
│ ├── audio/ # Generated audio files (per-job subdirectories)
│ └── samples/ # Voice preview samples and manifest.json
├── data/
│ └── voice_prompts/ # Chatterbox voice recordings for cloning
└── templates/
└── index.html # Web interface
GET /- Main web interfaceGET /api/health- Health check (TTS engine, Kokoro availability, CUDA status)GET /api/voices- Get available voices and preview sample statusPOST /api/voices/samples- Generate or regenerate voice preview samplesGET /api/settings- Get current settingsPOST /api/settings- Update settingsPOST /api/analyze- Analyze text and return statistics/speakersPOST /api/gemini/sections- Preview the sections (chapters/chunks) Gemini will process for a given inputPOST /api/gemini/process-section- Send a single section to Gemini (called in sequence by the frontend for live progress updates)POST /api/gemini/process- Process the entire text through Gemini in one backend call (used for scripted workflows)POST /api/gemini/models- Fetch available Gemini models after providing an API keyPOST /api/generate- Queue a new audio generation jobGET /api/status/<job_id>- Check status of a specific jobPOST /api/cancel/<job_id>- Cancel a queued or running jobGET /api/queue- Get all jobs, their status, and current queue sizeGET /api/download/<job_id>- Download generated audio fileGET /api/library- List all completed audio filesDELETE /api/library/<job_id>- Delete a specific library itemPOST /api/library/clear- Delete all library itemsGET /api/custom-voices- List custom voice blends (includes normalized metadata and component weights)POST /api/custom-voices- Create a new custom voice blendGET /api/custom-voices/<voice_id>- Retrieve a specific custom voice (ID orcustom_code)PUT /api/custom-voices/<voice_id>- Update an existing custom voice blendDELETE /api/custom-voices/<voice_id>- Delete a custom voice and invalidate cached tensorsGET /api/chatterbox-voices- List saved Chatterbox voice promptsPOST /api/chatterbox-voices- Upload a new Chatterbox voice promptPUT /api/chatterbox-voices/<voice_id>- Rename a Chatterbox voiceDELETE /api/chatterbox-voices/<voice_id>- Delete a Chatterbox voiceGET /api/chatterbox-voices/<voice_id>/preview- Preview a Chatterbox voice
- ~2 seconds per chunk (500 words)
- No API costs
- Full privacy
- ~2-3 seconds per chunk (varies by input)
- Cost varies by usage
- No GPU required
- Model: jaaari/kokoro-82m
- Requires ~8GB VRAM
- Voice cloning from 10-15 second audio samples
- No API costs
- Full privacy
- Voice cloning via cloud API
- Model: resemble-ai/chatterbox-turbo
- Cost varies by usage
- No GPU required
Make sure espeak-ng is installed and in your PATH.
Reduce chunk_size in settings or use a Replicate engine instead of local GPU.
Adjust the speed parameter (0.5 - 2.0) in settings.
Apache 2.0 - Same as Kokoro-82M
- Kokoro-82M by hexgrad
- Chatterbox by Resemble AI
- StyleTTS2 by yl4579
- Replicate for cloud API
For issues and questions, please open an issue on GitHub.