Skip to content

abjadai/catt-whisper

Repository files navigation

CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations

License

This is the official implementation of the paper "CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations". CATT-Whisper leverages audio features alongside text to predict diacritics, making it particularly useful for applications where both text and speech are available.

Features

  • Speech-conditioned diacritization: Leverages both text and audio for more accurate predictions
  • Whisper integration: Uses pretrained Whisper encoders (base, small, etc.) for audio feature extraction
  • Encoder-Only architecture: Based on the efficient EO model from CATT
  • Batch inference support: Process multiple text-audio pairs efficiently
  • Spec augmentation: Training-time audio augmentation for improved robustness

Quick Start

Clone this repo and follow instructions below:

git clone https://github.com/abjadai/catt-whisper
cd catt-whisper

File Structure

catt-whisper/
├── eo.py                   # Encoder-Only model architecture
├── eo_pl.py                # PyTorch Lightning wrapper for training
├── speech_encoder.py       # Whisper-based speech encoder
├── transformer.py          # Transformer building blocks
├── tashkeel_tokenizer.py   # Arabic tokenizer with diacritic support
├── tashkeel_dataset.py     # Dataset loader for text-audio pairs
├── spec_augment.py         # SpecAugment implementation
├── sparse_image_warp.py    # Utility for spec augmentation
├── bw2ar.py                # Buckwalter transliteration utilities
├── train_catt_whisper.py   # Training script
├── predict_catt_whisper.py # Inference script
└── test_catt_whisper.py    # Inference script for the test set

Download Pretrained Models

Download the pretrained CATT-Whisper model from the Releases section:

mkdir models/
wget -P models/ https://github.com/abjadai/catt-whisper/releases/download/v1/catt_whisper_base_model_v1_epoch_26_with_spec_augment.pt

Inference

Use the provided inference script to diacritize text conditioned on speech:

import torch
from eo_pl import TashkeelModel
from tashkeel_tokenizer import TashkeelTokenizer
from utils import remove_non_arabic
from whisper.audio import load_audio

# Initialize tokenizer and model
tokenizer = TashkeelTokenizer()
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = TashkeelModel(
    tokenizer,
    max_seq_len=1024,
    n_layers=6,
    learnable_pos_emb=False,
    speech_model_name='base'
)

# Load checkpoint
ckpt_path = 'models/catt_whisper_base_model_v1_epoch_26_with_spec_augment.pt'
model.load_state_dict(torch.load(ckpt_path, map_location=device))
model.eval().to(device)

# Prepare inputs
audio_files = ['path/to/audio.wav']
texts = ['المحتوى النصي داخل المقطع الصوتي']

audios = [load_audio(f) for f in audio_files]
texts = [remove_non_arabic(i) for i in texts]

# Predict diacritics
text_tashkeel = model.do_tashkeel_batch(texts, audios, batch_size=16, verbose=True)
print(text_tashkeel)

Or run the example script:

python predict_catt_whisper.py

Training

Prepare Dataset

You need to download the NADI 2025 Challenge Dataset first from HuggingFace. Download all parquet files and put them inside the folder dataset/nadi/NADI-2025-Sub-task-3-all/ from this link MBZUAI/NADI-2025-Sub-task-3-all.

Also, you will need to download the pretrained checkpoint of CATT EO model as follows:

mkdir models/
wget -P models/ https://github.com/abjadai/catt/releases/download/v2/best_eo_mlm_ns_epoch_193.pt

Configure Training

Modify the scripts train_catt_whisper.py and eo_pl.py to adjust training parameters then run the following command to start training:

python train_catt_whisper.py

Citation

If you use CATT-Whisper in your research, please cite both this work and the original CATT paper:

# CATT-Whisper citation
@inproceedings{ghannam2025abjad,
  title={Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations},
  author={Ghannam, Ahmad and Alharthi, Naif and Alasmary, Faris and Al Tabash, Kholood and Sadah, Shouq and Ghouti, Lahouari},
  booktitle={Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks},
  pages={757--761},
  year={2025}
}

# Original CATT paper
@inproceedings{alasmary2024catt,
  title={CATT: Character-based Arabic Tashkeel Transformer},
  author={Alasmary, Faris and Zaafarani, Orjuwan and Ghannam, Ahmad},
  booktitle={Proceedings of The Second Arabic Natural Language Processing Conference},
  pages={250--257},
  year={2024}
}

Acknowledgments

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

Official implementation of "Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations"

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages