Keywords: ARDiT, AR-DiT, Autoregressive Diffusion Transformer, TTS, Text-to-Speech, Mel-Spectrogram
A resource-friendly Text-to-Speech system inspired by AR-DiT (ARDiT), combining an autoregressive Transformer (Qwen3 LLM) with a diffusion model architecture. It generates Mel spectrograms through a diffusion process, then converts them to audio via a Vocoder.
✨ Minimal AR-DiT TTS training and inference pipeline that can train on an 8000-hour dataset using a single RTX 5090 (32GB) and produce intelligible speech synthesis results within two days.
PS: The diffusion backbone uses RFWave's ConvNeXt architecture, not DiT.
- 🚀 Resource-Friendly: Single RTX 5090 (32GB) can handle 8000-hour dataset training
- 📦 Minimal Implementation: Clean and concise code, easy to understand and modify, suitable for learning and development
- 🇨🇳 Chinese-Friendly: Complete Chinese documentation and Chinese data processing pipeline
- 🤗 Ready to Use: Provides pre-trained models and processed datasets for quick start
- 💡 Practical-Oriented: Achieves intelligible results in two days, better quality with longer training - practical rather than perfect
Audio samples generated by the trained model:
pip install -r requirements.txtWe provide pre-trained models on Hugging Face that you can use directly:
Download Pre-trained Model:
huggingface-cli download laupeng1989/armel-checkpoint --local-dir ./models/armel-checkpointRun Inference:
python3 scripts/mel_inference.py \
--model_path ./models/armel-checkpoint/ \
--text example_data/transcript/fanren_short.txt \
--ref_audio fanren08 \
--output_path output/generated \
--dtype bfloat16Output Files:
output/generated.wav: Generated audiooutput/generated.png: Mel spectrogram visualizationoutput/generated.npy: Mel spectrogram array
The --ref_audio parameter specifies the reference audio name (without extension). The script will read the corresponding .wav and .txt files from the example_data/voice_prompts/ directory:
example_data/voice_prompts/
├── fanren08.wav # Reference audio
├── fanren08.txt # Text corresponding to reference audio
├── fanren09.wav
└── fanren09.txt
You can add your own reference audio by placing the audio file and corresponding text file in this directory.
If you want to train your own model from scratch, follow these steps.
We provide a processed training dataset on Hugging Face:
- Training Dataset: laupeng1989/armel-dataset
Download Dataset:
huggingface-cli download laupeng1989/armel-dataset --repo-type dataset --local-dir ./data/armel-dataset💡 Tip: If using the Hugging Face dataset, you can skip the "Data Preparation" section below and proceed directly to training.
This project uses the Amphion Emilia preprocessor to process raw audio data.
Processed data format:
example_data/
├── 仙逆 第87集 身世苏醒(下) [638031163].json
├── 仙逆 第87集 身世苏醒(下) [638031163]_000000.m4a
├── 仙逆 第87集 身世苏醒(下) [638031163]_000001.m4a
├── 仙逆 第87集 身世苏醒(下) [638031163]_000002.m4a
└── ...
JSON file format (contains segmentation info and text):
[
{
"duration": 10.94,
"text": "[SPEAKER_00] 欢迎收听...",
"speaker": 0,
"parts": [
{
"text": "[SPEAKER_00] 欢迎收听...",
"start": 4.5125,
"end": 10.1525,
"speaker": 0,
"language": "zh"
}
]
}
]Use build_dataset.py to convert raw data to training format:
python scripts/build_dataset.py \
--data_dir <your_raw_data_dir> \
--output_dir <your_output_dir> \
--num_proc 8 \
--test_samples 100 \
--random_seed 42This project was trained on NVIDIA RTX 5090 (32GB).
Prepare Qwen3 Model:
model.llm_model_path can be:
- Local path: e.g.,
./Qwen3-0.6B(requires prior download) - Hugging Face model name: e.g.,
Qwen/Qwen3-0.6B(auto-downloads, but first training will be slower)
Recommended to download locally first:
huggingface-cli download Qwen/Qwen3-0.6B --local-dir ./Qwen3-0.6BTraining Command:
python3 scripts/mel_train.py \
dataset.train_dataset_path=<your_train_data_path> \
dataset.valid_dataset_path=<your_valid_data_path> \
model.llm_model_path=./Qwen3-0.6B \
model.rfmel.batch_mul=2 \
training.batch_size=4 \
dataset.max_tokens=1024 \
training.num_workers=16 \
training.learning_rate=0.0001 \
training.log_dir=<your_log_dir> \
training.diffusion_extra_steps=4 \
training.check_val_every_n_epoch=1 \
model.use_skip_connection=true \
model.estimator.hidden_dim=512 \
model.estimator.intermediate_dim=1536 \
model.estimator.num_layers=8Note: Lightning automatically detects and uses all available GPUs with DDP strategy. You may need to adjust batch_size, batch_mul, max_tokens based on your hardware configuration.
After training, export the model for inference:
python scripts/mel_export_checkpoint.py \
--ckpt_path <your_checkpoint_path>/last.ckpt \
--output_path ./exported_model/Or specify the checkpoints directory directly (automatically selects the latest):
python scripts/mel_export_checkpoint.py \
--ckpt_path <your_checkpoint_dir>/ \
--output_path ./exported_model/This will generate:
model.ckpt: Model weightsmodel.yaml: Inference configuration
After exporting, you can use the inference commands from the "Inference with Pre-trained Model" section above.
ar-dit-mel/
├── ar/ # Autoregressive model
│ ├── armel.py # ARMel main model
│ ├── qwen.py # Qwen3 LLM
│ └── mel_generate.py # Mel generation
├── rfwave/ # Diffusion model
│ ├── mel_model.py # RFMel model
│ ├── mel_processor.py # Mel processor
│ └── estimator.py # Diffusion estimator
├── dataset/ # Dataset
├── scripts/ # Training and inference scripts
│ ├── build_dataset.py # Build dataset
│ ├── mel_train.py # Training script
│ ├── mel_export_checkpoint.py # Export model
│ └── mel_inference.py # Inference script
└── configs/ # Configuration files
MIT License
-
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis Zhijun Liu, et al. arXiv:2406.05551
-
VibeVoice Technical Report Zhiliang Peng, et al. arXiv:2508.19205
-
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning Yixuan Zhou, et al. arXiv:2509.24650
This project is based on the following open-source projects: