Skip to content

bfs18/armel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Poorman's AR-DiT TTS 📢

Keywords: ARDiT, AR-DiT, Autoregressive Diffusion Transformer, TTS, Text-to-Speech, Mel-Spectrogram

A resource-friendly Text-to-Speech system inspired by AR-DiT (ARDiT), combining an autoregressive Transformer (Qwen3 LLM) with a diffusion model architecture. It generates Mel spectrograms through a diffusion process, then converts them to audio via a Vocoder.

✨ Minimal AR-DiT TTS training and inference pipeline that can train on an 8000-hour dataset using a single RTX 5090 (32GB) and produce intelligible speech synthesis results within two days.

PS: The diffusion backbone uses RFWave's ConvNeXt architecture, not DiT.

🌟 Why Choose This Project?

  • 🚀 Resource-Friendly: Single RTX 5090 (32GB) can handle 8000-hour dataset training
  • 📦 Minimal Implementation: Clean and concise code, easy to understand and modify, suitable for learning and development
  • 🇨🇳 Chinese-Friendly: Complete Chinese documentation and Chinese data processing pipeline
  • 🤗 Ready to Use: Provides pre-trained models and processed datasets for quick start
  • 💡 Practical-Oriented: Achieves intelligible results in two days, better quality with longer training - practical rather than perfect

🎵 Generation Examples

Audio samples generated by the trained model:

Your browser does not support the audio element. Download audio

📦 Installation

pip install -r requirements.txt

🚀 Quick Start

🎤 Inference with Pre-trained Model

We provide pre-trained models on Hugging Face that you can use directly:

Download Pre-trained Model:

huggingface-cli download laupeng1989/armel-checkpoint --local-dir ./models/armel-checkpoint

Run Inference:

python3 scripts/mel_inference.py \
  --model_path ./models/armel-checkpoint/ \
  --text example_data/transcript/fanren_short.txt \
  --ref_audio fanren08 \
  --output_path output/generated \
  --dtype bfloat16

Output Files:

  • output/generated.wav: Generated audio
  • output/generated.png: Mel spectrogram visualization
  • output/generated.npy: Mel spectrogram array

🎧 Reference Audio Instructions

The --ref_audio parameter specifies the reference audio name (without extension). The script will read the corresponding .wav and .txt files from the example_data/voice_prompts/ directory:

example_data/voice_prompts/
├── fanren08.wav          # Reference audio
├── fanren08.txt          # Text corresponding to reference audio
├── fanren09.wav
└── fanren09.txt

You can add your own reference audio by placing the audio file and corresponding text file in this directory.


🔥 Training from Scratch

If you want to train your own model from scratch, follow these steps.

🤗 Training Dataset

We provide a processed training dataset on Hugging Face:

Download Dataset:

huggingface-cli download laupeng1989/armel-dataset --repo-type dataset --local-dir ./data/armel-dataset

💡 Tip: If using the Hugging Face dataset, you can skip the "Data Preparation" section below and proceed directly to training.

📊 Data Preparation

1️⃣ Prepare Raw Data

This project uses the Amphion Emilia preprocessor to process raw audio data.

Processed data format:

example_data/
├── 仙逆 第87集 身世苏醒(下) [638031163].json
├── 仙逆 第87集 身世苏醒(下) [638031163]_000000.m4a
├── 仙逆 第87集 身世苏醒(下) [638031163]_000001.m4a
├── 仙逆 第87集 身世苏醒(下) [638031163]_000002.m4a
└── ...

JSON file format (contains segmentation info and text):

[
  {
    "duration": 10.94,
    "text": "[SPEAKER_00] 欢迎收听...",
    "speaker": 0,
    "parts": [
      {
        "text": "[SPEAKER_00] 欢迎收听...",
        "start": 4.5125,
        "end": 10.1525,
        "speaker": 0,
        "language": "zh"
      }
    ]
  }
]

2️⃣ Build Training Dataset

Use build_dataset.py to convert raw data to training format:

python scripts/build_dataset.py \
  --data_dir <your_raw_data_dir> \
  --output_dir <your_output_dir> \
  --num_proc 8 \
  --test_samples 100 \
  --random_seed 42

🔥 Training

💻 Training Hardware

This project was trained on NVIDIA RTX 5090 (32GB).

⚡ Training Command

Prepare Qwen3 Model:

model.llm_model_path can be:

  • Local path: e.g., ./Qwen3-0.6B (requires prior download)
  • Hugging Face model name: e.g., Qwen/Qwen3-0.6B (auto-downloads, but first training will be slower)

Recommended to download locally first:

huggingface-cli download Qwen/Qwen3-0.6B --local-dir ./Qwen3-0.6B

Training Command:

python3 scripts/mel_train.py \
  dataset.train_dataset_path=<your_train_data_path> \
  dataset.valid_dataset_path=<your_valid_data_path> \
  model.llm_model_path=./Qwen3-0.6B \
  model.rfmel.batch_mul=2 \
  training.batch_size=4 \
  dataset.max_tokens=1024 \
  training.num_workers=16 \
  training.learning_rate=0.0001 \
  training.log_dir=<your_log_dir> \
  training.diffusion_extra_steps=4 \
  training.check_val_every_n_epoch=1 \
  model.use_skip_connection=true \
  model.estimator.hidden_dim=512 \
  model.estimator.intermediate_dim=1536 \
  model.estimator.num_layers=8

Note: Lightning automatically detects and uses all available GPUs with DDP strategy. You may need to adjust batch_size, batch_mul, max_tokens based on your hardware configuration.

📤 Export Model

After training, export the model for inference:

python scripts/mel_export_checkpoint.py \
  --ckpt_path <your_checkpoint_path>/last.ckpt \
  --output_path ./exported_model/

Or specify the checkpoints directory directly (automatically selects the latest):

python scripts/mel_export_checkpoint.py \
  --ckpt_path <your_checkpoint_dir>/ \
  --output_path ./exported_model/

This will generate:

  • model.ckpt: Model weights
  • model.yaml: Inference configuration

After exporting, you can use the inference commands from the "Inference with Pre-trained Model" section above.


📁 Project Structure

ar-dit-mel/
├── ar/                      # Autoregressive model
│   ├── armel.py            # ARMel main model
│   ├── qwen.py             # Qwen3 LLM
│   └── mel_generate.py     # Mel generation
├── rfwave/                  # Diffusion model
│   ├── mel_model.py        # RFMel model
│   ├── mel_processor.py    # Mel processor
│   └── estimator.py        # Diffusion estimator
├── dataset/                 # Dataset
├── scripts/                 # Training and inference scripts
│   ├── build_dataset.py    # Build dataset
│   ├── mel_train.py        # Training script
│   ├── mel_export_checkpoint.py  # Export model
│   └── mel_inference.py    # Inference script
└── configs/                 # Configuration files

📜 License

MIT License

📚 Related Papers

  • Autoregressive Diffusion Transformer for Text-to-Speech Synthesis Zhijun Liu, et al. arXiv:2406.05551

  • VibeVoice Technical Report Zhiliang Peng, et al. arXiv:2508.19205

  • VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning Yixuan Zhou, et al. arXiv:2509.24650

🙏 Acknowledgments

This project is based on the following open-source projects:

About

poorman's ar-dit tts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages