Gian Mario Favero, September 2025
- Upload Inputs – Provide an
.mp4video and matching.srtsubtitle file for translation. - Subtitles Toggle – Choose whether to render subtitles in the final output.
- Translation Type – Select between:
- Dub – Simple audio replacement with translated speech.
- Full Lip-Sync – Uses Wav2Lip models to synchronize lip movements with the translated audio.
- Random Seed – Set a seed to control generation randomness and reproducibility.
- Advanced Wav2Lip Settings (Optional):
- Choose a specific Wav2Lip model.
- Configure padding for mouth movements.
- Adjust processing resize factor for balancing quality and speed.
- Download Outputs:
- Download the generated: German transcript (de_audio.srt), German audio (de_audio.wav), or dubbed/lipsynced video (output.mp4).
We tested this project on Python ≥ 3.10. It is recommended to create a fresh virtual environment before installing dependencies.
git clone https://github.com/faverogian/video-translation.git
cd video-translationBefore installing Python dependencies, make sure you have the following system packages:
- ffmpeg – required for audio/video processing
- espeak-ng – required for some TTS backends
- Python 3.10+
- (Optional, but recommended) CUDA toolkit if you want GPU acceleration with PyTorch
brew install ffmpeg espeaksudo apt update
sudo apt install ffmpeg espeak-ng- Install FFmpeg and add it to your PATH
- Install espeak-ng
# Create and activate a virtual environment (Python 3.10+)
python -m venv venv
source venv/bin/activate # On macOS/Linux
venv\Scripts\activate # On Windows
# Install dependencies
pip install -r requirements.txt # --no-cache-dir can ensure fresh installation
pip install lipsync # --no-cache-dir can ensure fresh installationNote that two pip install commands have to be run. There are soft dependency conflicts between lipsync and coqui-tts that get improperly resolved if all packages are in requirements.txt. After running pip install lipsync, there will appear to be an error -- this is expected and the code should work as usual.
This project makes use of the moshown/lipsync library for lip dubbing. For full use of the product, we recommend downloading both available models to have some control over generation quality.
-
Download the Wav2Lip model weights:
- wav2lip.pth: More accurate synchronization, more blurry output
- wav2lip_gan.pth: Less accurate synchronization, less blurry output
-
Upload
wav2lip.pthandwav2lip_gan.pthto theweights/directory.
After installing the dependencies, you can run the Gradio app locally.
source venv/bin/activate
python main.pyBy default, Gradio will launch a local server and provide you with a link such as:
Running on local URL: http://127.0.0.1:7860
Additionally, there should be a public URL provided. If using a remote/ssh connection, this public URL works without port forwarding. Sometimes the public URL fails to create -- by re-running this should be resolved.
Copy and paste the link into your browser to interact with the app. If using a local machine or port forwarding, use the local URL. Otherwise, use the public URL.
Upon running for the first time, several models in the pipeline will have their weights automatically downloaded. You need to accept the terms of Coqui-TTS license agreement in the terminal by entering 'y':
> You must confirm the following:
| > "I have purchased a commercial license from Coqui: licensing@coqui.ai"
| > "Otherwise, I agree to the terms of the non-commercial CPML: https://coqui.ai/cpml" - [y/n]Processing speed depends on video resolution, length, and machine hardware. For example, full lip synchronization with Wav2Lip on a 1 minute long, 1920×1080 video takes ~5 minutes on a single NVIDIA V100 (16GB). Lower-end GPUs or CPUs will take longer.
Our tool converts an English video into a German-dubbed or lip-synced version while preserving timing, naturalness, and original quality.
GPU is automatically used if available, otherwise the pipeline defaults to a CPU implementation.
-
Transcript Translation
- We translate the input
.srtsubtitle file from English → German using Helsinki-NLP/opus-mt-en-de (Transformers implementation from Hugging Face). - Input: English
.srtfile - Output: German
.srtfile with the same timestamps
- We translate the input
-
Video/Audio Separation
- Using pydub, we split the original
.mp4into separate video and audio tracks for processing.
- Using pydub, we split the original
-
Voice Cloning & TTS
- We use the Coqui XTTS model to first clone the speaker’s voice from the original audio, then generate German audio line by line.
- Synchronization challenges: TTS can hallucinate (extra words) or produce slower-than-expected speech.
- Correction pipeline:
- We compare TTS output timings with the original subtitle timestamps. Usually, this output is too slow by a little (slow pace of speech) or by a lot (hallucinations).
- If misaligned, we re-generate the segment, but progressively sped up.
- We allow up to 15 retries, keeping the best-aligned sample.
- This corrects both hallucinations and slow cadence, ensuring synchronized speech.
-
Audio Replacement
- The new German audio is swapped into the original video, replacing the English track.
-
Optional Subtitles
- If selected, we overlay the German
.srtfile onto the video.
- If selected, we overlay the German
-
Optional Lip Synchronization
- If lip-sync is requested, we use lipsync to adjust mouth movements to match the new German audio.
-
Output
- Outputs are downloadable via the Gradio application (
de_audio.srt,de_audio.wav,output.mp4).
- Outputs are downloadable via the Gradio application (
While the tool produces usable German-dubbed and lip-synced videos, there are several limitations to keep in mind:
-
Translation Quality
Transcript translation is handled by a pretrained machine translation model (Helsinki-NLP/opus-mt-en-de). Output may contain errors, unnatural phrasing, or context mismatches depending on the input. -
TTS Variability
The Coqui XTTS model sometimes introduces artifacts such as hallucinated words or unnatural prosody. We mitigate this with alignment retries, but quality still varies.- Tip: try generating with different random seeds for more natural results.
-
Language Mismatch in Timing
Our main priority is keeping the original video un-edited. German typically requires more syllables per second than English, making perfect alignment challenging.- We handle this by speeding up TTS audio when necessary, but this can occasionally sound unnatural.
-
Lip Synchronization
Lip-sync is powered by the open-source Wav2Lip framework. While effective, it is far from perfect:- Lip movements may appear slightly exaggerated, rigid, or unnatural. There is much room for improvement with more advanced or modern lip-sync models.
-
Performance & Hardware Constraints
Processing speed depends heavily on GPU power. For example, full lip synchronization with Wav2Lip on a 1920×1080 video takes ~5 minutes on a single NVIDIA V100 (16GB). Lower-end GPUs or CPUs will take longer. -
Gradio Bugs Known bug with video preview: sometimes the video will not play. Can be viewed if downloaded.
We envision several improvements to enhance the quality, flexibility, and efficiency of the pipeline:
-
Advanced & Optimized Models
Each stage of the pipeline (translation, TTS, lip-sync) can benefit from more powerful and efficient models.- For example: using LLMs for word-efficient translation (Mistral, LLaMA, Qwen, Gemini), alternative TTS models for natural prosody (Kokoro), or modern lip-sync models for faster and more accurate mouth movements (LatentSync, MuseTalk, Halo).
- Performing a full ablation study would help understand the contribution of each component and identify bottlenecks or quality limitations.
-
Voice Separation for Complex Audio
Currently, background music or other audio sources are processed along with the speech, which can cause artifacts when dubbing.- Using tools like Demucs to isolate the speaker’s voice would allow:
- Only translating the voice track while preserving background music, sound effects, and ambience.
- Cleaner and more natural-sounding dubbed audio for videos with complex soundscapes.
- Using tools like Demucs to isolate the speaker’s voice would allow:
-
Extended Multi-Language Support
- Support for additional source/target languages beyond English → German.
- Integration with multilingual TTS and lip-sync models for consistent pipeline behavior.
-
Performance Optimizations
- GPU-specific optimizations, mixed-precision inference, or batching strategies for faster processing of long videos.
- Potential to reduce memory footprint and processing time for high-resolution or long-form videos.
Please be mindful when using lipsync. This library can generate videos that look convincing, so it could be used to spread disinformation or harm someone’s reputation. We encourage using it only for entertainment or scientific purposes, and always with respect and consent from any people involved.
@inproceedings{10.1145/3394171.3413532,
author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
title = {A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild},
year = {2020},
isbn = {9781450379885},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3394171.3413532},
doi = {10.1145/3394171.3413532},
booktitle = {Proceedings of the 28th ACM International Conference on Multimedia},
pages = {484–492},
numpages = {9},
keywords = {lip sync, talking face generation, video generation},
location = {Seattle, WA, USA},
series = {MM '20}
}
@InProceedings{TiedemannThottingal:EAMT2020,
author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
year = {2020},
address = {Lisbon, Portugal}
}

