This is the official implementation of the paper "CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations". CATT-Whisper leverages audio features alongside text to predict diacritics, making it particularly useful for applications where both text and speech are available.
- Speech-conditioned diacritization: Leverages both text and audio for more accurate predictions
- Whisper integration: Uses pretrained Whisper encoders (base, small, etc.) for audio feature extraction
- Encoder-Only architecture: Based on the efficient EO model from CATT
- Batch inference support: Process multiple text-audio pairs efficiently
- Spec augmentation: Training-time audio augmentation for improved robustness
Clone this repo and follow instructions below:
git clone https://github.com/abjadai/catt-whisper
cd catt-whispercatt-whisper/
├── eo.py # Encoder-Only model architecture
├── eo_pl.py # PyTorch Lightning wrapper for training
├── speech_encoder.py # Whisper-based speech encoder
├── transformer.py # Transformer building blocks
├── tashkeel_tokenizer.py # Arabic tokenizer with diacritic support
├── tashkeel_dataset.py # Dataset loader for text-audio pairs
├── spec_augment.py # SpecAugment implementation
├── sparse_image_warp.py # Utility for spec augmentation
├── bw2ar.py # Buckwalter transliteration utilities
├── train_catt_whisper.py # Training script
├── predict_catt_whisper.py # Inference script
└── test_catt_whisper.py # Inference script for the test set
Download the pretrained CATT-Whisper model from the Releases section:
mkdir models/
wget -P models/ https://github.com/abjadai/catt-whisper/releases/download/v1/catt_whisper_base_model_v1_epoch_26_with_spec_augment.ptUse the provided inference script to diacritize text conditioned on speech:
import torch
from eo_pl import TashkeelModel
from tashkeel_tokenizer import TashkeelTokenizer
from utils import remove_non_arabic
from whisper.audio import load_audio
# Initialize tokenizer and model
tokenizer = TashkeelTokenizer()
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = TashkeelModel(
tokenizer,
max_seq_len=1024,
n_layers=6,
learnable_pos_emb=False,
speech_model_name='base'
)
# Load checkpoint
ckpt_path = 'models/catt_whisper_base_model_v1_epoch_26_with_spec_augment.pt'
model.load_state_dict(torch.load(ckpt_path, map_location=device))
model.eval().to(device)
# Prepare inputs
audio_files = ['path/to/audio.wav']
texts = ['المحتوى النصي داخل المقطع الصوتي']
audios = [load_audio(f) for f in audio_files]
texts = [remove_non_arabic(i) for i in texts]
# Predict diacritics
text_tashkeel = model.do_tashkeel_batch(texts, audios, batch_size=16, verbose=True)
print(text_tashkeel)Or run the example script:
python predict_catt_whisper.pyYou need to download the NADI 2025 Challenge Dataset first from HuggingFace. Download all parquet files and put them inside the folder dataset/nadi/NADI-2025-Sub-task-3-all/ from this link
MBZUAI/NADI-2025-Sub-task-3-all.
Also, you will need to download the pretrained checkpoint of CATT EO model as follows:
mkdir models/
wget -P models/ https://github.com/abjadai/catt/releases/download/v2/best_eo_mlm_ns_epoch_193.ptModify the scripts train_catt_whisper.py and eo_pl.py to adjust training parameters then run the following command to start training:
python train_catt_whisper.pyIf you use CATT-Whisper in your research, please cite both this work and the original CATT paper:
# CATT-Whisper citation
@inproceedings{ghannam2025abjad,
title={Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations},
author={Ghannam, Ahmad and Alharthi, Naif and Alasmary, Faris and Al Tabash, Kholood and Sadah, Shouq and Ghouti, Lahouari},
booktitle={Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks},
pages={757--761},
year={2025}
}
# Original CATT paper
@inproceedings{alasmary2024catt,
title={CATT: Character-based Arabic Tashkeel Transformer},
author={Alasmary, Faris and Zaafarani, Orjuwan and Ghannam, Ahmad},
booktitle={Proceedings of The Second Arabic Natural Language Processing Conference},
pages={250--257},
year={2024}
}- Built upon CATT by Abjad AI
- Uses OpenAI Whisper for speech encoding
- Transformer implementation adapted from hyunwoongko/transformer
- Arabic text processing utilities from pyarabic
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.