Voxtral Voice Clone

Training the missing codec encoder for Mistral's Voxtral-4B-TTS, enabling zero-shot voice cloning on the open-weight model.

What This Does

Mistral released Voxtral-4B-TTS-2603 with an important gap: the codec encoder weights were not included. Without them, the model is limited to 20 preset voices and cannot clone new voices from audio.

This project:

Trains the codec encoder from scratch, following the paper's training recipe (stochastic quantization, ASR distillation, multi-resolution discriminator)
Fine-tunes the LLM with LoRA so it interprets our encoder's output for voice identity transfer
Provides tooling to inject the trained weights and enable ref_audio voice cloning

Architecture

The Voxtral codec is a VQ-FSQ hybrid that compresses audio to 2.14 kbps:

12.5 Hz frame rate (240-sample patch at 24kHz, 8x downsampling)
1 semantic code (VQ, 8192 entries) + 36 acoustic codes (FSQ, 21 levels)
Voice embeddings = sum of 37 codebook lookups per frame -> [N, 3072]

See ARCHITECTURE.md for the full technical breakdown, weight mapping, and research findings.

Quick Start

Requirements

1x GPU with >= 80GB VRAM (A100/H100/GH200)
Voxtral-4B-TTS-2603 weights downloaded
Python 3.10+

pip install -r requirements.txt

Phase 1: Train Codec Encoder

export VOXTRAL_CKPT=/path/to/Voxtral-4B-TTS-2603
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

python train_encoder.py

The script auto-detects and combines datasets from:

LibriSpeech (train-clean-360, train-other-500)
Common Voice (English, Arabic)
Generated preset clips

Training follows the paper's recipe:

Stochastic quantization (50% quantize / 25% dither / 25% passthrough for acoustic)
Whisper ASR distillation for semantic token diversity
8 multi-resolution STFT discriminators with feature matching
Exponentially decaying reconstruction losses

Phase 2: Full Pipeline LoRA

python train_full_pipeline.py

Distills the LLM to interpret our encoder's voice embeddings by matching hidden states between preset embeddings (teacher) and our encoder's output (student) across all 26 transformer layers.

Inject Weights & Clone

python inject_encoder.py

# Enable custom voices in the tokenizer
export VOXTRAL_VOICE_DIR=/path/to/voice_embeddings
python patch_tokenizer.py

After injection, any serving framework that loads the checkpoint will have ref_audio cloning enabled. Pass a reference audio clip and the model will generate speech in that voice.

Training Details

Phase 1 - Codec Encoder

Hyperparameter	Value
Batch size	4
Max audio length	4 seconds
Learning rate	3e-4 (cosine decay)
Optimizer	AdamW (betas 0.8, 0.99)
Discriminator warmup	2000 steps
Epochs	10-20

Phase 2 - LoRA Distillation

Hyperparameter	Value
LoRA rank	8
LoRA targets	wq, wk, wv, wo (all 26 layers)
Learning rate	2e-5
Loss	MSE on hidden states at audio positions

Dataset Recommendations

For research-quality results:

Minimum: ~30k clips (~100h) from LibriSpeech train-clean-360
Recommended: ~300k clips (~900h) mixing LibriSpeech + Common Voice
Production: 1M+ clips across languages and speaker diversity

Research Narrative

The Problem

Voxtral's codec encoder is deliberately withheld from the open-weight release. The decoder, quantizer, and all LLM weights are available, but the encoder -- which converts raw audio to the discrete code space -- is missing. This means the model cannot process reference audio for voice cloning.

Reverse Engineering the Architecture

The encoder architecture is fully specified in the model's serving code. We discovered a 4-stage convolutional-transformer encoder with:

149M parameters across 114 weight tensors
8 causal transformer layers with ALiBi attention
Sliding window attention halving at each downsampling stage
VQ-FSQ hybrid quantization splitting 292 latent dims

The Invisible Walls

Wall 1 - Codebook Collapse: Naive training produces sem_util=1/8192 (only 1 of 8192 semantic codes used). Solved by ASR distillation from Whisper.

Wall 2 - Binary Code Saturation: Without stochastic quantization, acoustic codes collapse to extremes (0 and 20 only). The 50/25/25 schedule from the paper teaches intermediate values.

Wall 3 - Training-Inference Mismatch: An encoder that reconstructs audio perfectly can still produce voice embeddings the LLM rejects. Solved by Phase 2 LoRA distillation that adapts the LLM to our encoder's code patterns.

Wall 4 - Embedding Sensitivity: The LLM distinguishes 20 preset voices using only 2-3% cosine similarity differences. Our embeddings must match not just the statistical distribution but the per-dimension sparsity pattern of genuine voice embeddings.

Current Status

Phase 1 (codec encoder training) is producing promising results with paper-aligned training. Phase 2 (LoRA distillation) follows.

License

This project is licensed under CC BY-NC 4.0. The trained weights are derivative of Mistral's Voxtral-4B-TTS model and subject to its license terms.

Citation

@misc{voxtral-voice-clone,
  title={Training the Missing Voxtral Codec Encoder for Zero-Shot Voice Cloning},
  author={al0olo},
  year={2025},
  url={https://github.com/al0olo/voxtral-voice-clone}
}

Acknowledgements

Mistral AI for the Voxtral-4B-TTS model and paper
OpenAI Whisper for ASR distillation
The LibriSpeech and Common Voice communities for open audio datasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voxtral Voice Clone

What This Does

Architecture

Quick Start

Requirements

Phase 1: Train Codec Encoder

Phase 2: Full Pipeline LoRA

Inject Weights & Clone

Training Details

Phase 1 - Codec Encoder

Phase 2 - LoRA Distillation

Dataset Recommendations

Research Narrative

The Problem

Reverse Engineering the Architecture

The Invisible Walls

Current Status

License

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
README.md		README.md
inject_encoder.py		inject_encoder.py
patch_tokenizer.py		patch_tokenizer.py
requirements.txt		requirements.txt
train_encoder.py		train_encoder.py
train_full_pipeline.py		train_full_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

Voxtral Voice Clone

What This Does

Architecture

Quick Start

Requirements

Phase 1: Train Codec Encoder

Phase 2: Full Pipeline LoRA

Inject Weights & Clone

Training Details

Phase 1 - Codec Encoder

Phase 2 - LoRA Distillation

Dataset Recommendations

Research Narrative

The Problem

Reverse Engineering the Architecture

The Invisible Walls

Current Status

License

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages