CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations

This is the official implementation of the paper "CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations". CATT-Whisper leverages audio features alongside text to predict diacritics, making it particularly useful for applications where both text and speech are available.

Features

Speech-conditioned diacritization: Leverages both text and audio for more accurate predictions
Whisper integration: Uses pretrained Whisper encoders (base, small, etc.) for audio feature extraction
Encoder-Only architecture: Based on the efficient EO model from CATT
Batch inference support: Process multiple text-audio pairs efficiently
Spec augmentation: Training-time audio augmentation for improved robustness

Quick Start

Clone this repo and follow instructions below:

git clone https://github.com/abjadai/catt-whisper
cd catt-whisper

File Structure

catt-whisper/
├── eo.py                   # Encoder-Only model architecture
├── eo_pl.py                # PyTorch Lightning wrapper for training
├── speech_encoder.py       # Whisper-based speech encoder
├── transformer.py          # Transformer building blocks
├── tashkeel_tokenizer.py   # Arabic tokenizer with diacritic support
├── tashkeel_dataset.py     # Dataset loader for text-audio pairs
├── spec_augment.py         # SpecAugment implementation
├── sparse_image_warp.py    # Utility for spec augmentation
├── bw2ar.py                # Buckwalter transliteration utilities
├── train_catt_whisper.py   # Training script
├── predict_catt_whisper.py # Inference script
└── test_catt_whisper.py    # Inference script for the test set

Download Pretrained Models

Download the pretrained CATT-Whisper model from the Releases section:

mkdir models/
wget -P models/ https://github.com/abjadai/catt-whisper/releases/download/v1/catt_whisper_base_model_v1_epoch_26_with_spec_augment.pt

Inference

Use the provided inference script to diacritize text conditioned on speech:

import torch
from eo_pl import TashkeelModel
from tashkeel_tokenizer import TashkeelTokenizer
from utils import remove_non_arabic
from whisper.audio import load_audio

# Initialize tokenizer and model
tokenizer = TashkeelTokenizer()
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = TashkeelModel(
    tokenizer,
    max_seq_len=1024,
    n_layers=6,
    learnable_pos_emb=False,
    speech_model_name='base'
)

# Load checkpoint
ckpt_path = 'models/catt_whisper_base_model_v1_epoch_26_with_spec_augment.pt'
model.load_state_dict(torch.load(ckpt_path, map_location=device))
model.eval().to(device)

# Prepare inputs
audio_files = ['path/to/audio.wav']
texts = ['المحتوى النصي داخل المقطع الصوتي']

audios = [load_audio(f) for f in audio_files]
texts = [remove_non_arabic(i) for i in texts]

# Predict diacritics
text_tashkeel = model.do_tashkeel_batch(texts, audios, batch_size=16, verbose=True)
print(text_tashkeel)

Or run the example script:

python predict_catt_whisper.py

Training

Prepare Dataset

You need to download the NADI 2025 Challenge Dataset first from HuggingFace. Download all parquet files and put them inside the folder dataset/nadi/NADI-2025-Sub-task-3-all/ from this link MBZUAI/NADI-2025-Sub-task-3-all.

Also, you will need to download the pretrained checkpoint of CATT EO model as follows:

mkdir models/
wget -P models/ https://github.com/abjadai/catt/releases/download/v2/best_eo_mlm_ns_epoch_193.pt

Configure Training

Modify the scripts train_catt_whisper.py and eo_pl.py to adjust training parameters then run the following command to start training:

python train_catt_whisper.py

Citation

If you use CATT-Whisper in your research, please cite both this work and the original CATT paper:

# CATT-Whisper citation
@inproceedings{ghannam2025abjad,
  title={Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations},
  author={Ghannam, Ahmad and Alharthi, Naif and Alasmary, Faris and Al Tabash, Kholood and Sadah, Shouq and Ghouti, Lahouari},
  booktitle={Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks},
  pages={757--761},
  year={2025}
}

# Original CATT paper
@inproceedings{alasmary2024catt,
  title={CATT: Character-based Arabic Tashkeel Transformer},
  author={Alasmary, Faris and Zaafarani, Orjuwan and Ghannam, Ahmad},
  booktitle={Proceedings of The Second Arabic Natural Language Processing Conference},
  pages={250--257},
  year={2024}
}

Acknowledgments

Built upon CATT by Abjad AI
Uses OpenAI Whisper for speech encoding
Transformer implementation adapted from hyunwoongko/transformer
Arabic text processing utilities from pyarabic

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations

Features

Quick Start

File Structure

Download Pretrained Models

Inference

Training

Prepare Dataset

Configure Training

Citation

Acknowledgments

License

Contributing

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dataset		dataset
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bw2ar.py		bw2ar.py
eo.py		eo.py
eo_pl.py		eo_pl.py
predict_catt_whisper.py		predict_catt_whisper.py
sparse_image_warp.py		sparse_image_warp.py
spec_augment.py		spec_augment.py
speech_encoder.py		speech_encoder.py
tashkeel_dataset.py		tashkeel_dataset.py
tashkeel_tokenizer.py		tashkeel_tokenizer.py
test_catt_whisper.py		test_catt_whisper.py
train_catt_whisper.py		train_catt_whisper.py
transformer.py		transformer.py
utils.py		utils.py
xer.py		xer.py

Folders and files

Latest commit

History

Repository files navigation

CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations

Features

Quick Start

File Structure

Download Pretrained Models

Inference

Training

Prepare Dataset

Configure Training

Citation

Acknowledgments

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages