An enhanced, one-click installable studio built on the Echo-TTS project by Jordan Darefsky, with additional code (chunking, etc) borrowed from KevinAHM/echo-tts-api - enhanced for creators who want a quality TTS with voice cloning and video translation dubbing without the setup headaches.
Model: jordand/echo-tts-base | Blog: echo-tts blog post
- VRAM: 12 GB minimum (NVIDIA GPU recommended)
- Platform: Windows · Linux · macOS
- Install: One click via Pinokio — handles Python, dependencies, and model downloads automatically
- Install Pinokio if you haven't already
- Click the install badge above, or search "EchoStudio" in the Pinokio app
- Hit Install → Start → done
- Voice cloning from reference audio
- Multi-speaker support (S1/S2 tagging)
- Long-form generation with automatic text chunking and crossfade stitching
- Sampler presets and full control over CFG guidance, sampling style, and KV scaling
- Upload video, extract audio, and transcribe/translate with Whisper
- Editable transcript with segment timing
- Re-voice translated speech with TTS using cloned or saved voices
- Preserve background audio — AI source separation mixes ambient/background with the new TTS voice
- Multi-speaker dubbing with S1/S2 tags
- Upload audio or video files as voice sources
- Edit saved voices directly
- Clip, trim silence, adjust speed, and normalize volume
- Vocal isolation — separate clean vocals from noisy recordings (BS-Roformer, MDX-Net via audio-separator)
- Background isolation for extracting ambience/music
- Save edited voices as named profiles with cached speaker latents
- Theme selection, memory management, custom output directory, temp file cleanup
Echo generates up to 30 seconds of audio per chunk. Longer text is automatically split and stitched with configurable silence gaps and crossfade. Shorter text produces shorter outputs naturally.
Up to 5 minutes of reference audio is supported, but shorter clips (10 seconds or less) work well too. Use the Voices tab to clip, clean, and isolate vocals from noisy recordings.
If the model generates a different speaker than expected, enable "Force Speaker" (default scale 1.5). Aim for the lowest scale that produces the correct speaker.
Use [S1] and [S2] for speaker tags. Expression markers like (laughs), (angry), (whispering) control tone. Commas function as pauses.
Don't use this model to impersonate real people without consent or generate deceptive audio. You are responsible for complying with local laws regarding biometric data and voice cloning.
Code in this repo is MIT-licensed except where file headers specify otherwise (e.g., autoencoder.py is Apache-2.0).
Audio outputs are CC-BY-NC-SA-4.0 due to the dependency on the Fish Speech S1-DAC autoencoder. Echo-TTS weights are released under CC-BY-NC-SA-4.0.
@misc{darefsky2025echo,
author = {Darefsky, Jordan},
title = {Echo-TTS},
year = {2025},
url = {https://jordandarefsky.com/blog/2025/echo/}
}