WFL-ASR: Whisper/WavLM for Phoneme Labeling

WFL-ASR is a configurable deep learning model designed for automatic phoneme segmentation using frame-level BIO tagging. It supports both Whisper and WavLM as audio encoders, and is structured for flexible and efficient training on phoneme-aligned datasets.

How It Works

This model performs frame-level phoneme labeling using the BIO tag format (B-, I-, O).

1. Label Preprocessing

.lab files define phoneme segments using HTK format.
Each segment is converted into BIO tags aligned to time frames based on frame_duration (hardcoded to 20ms for Whisper compatibility).
Tags are stored along with the audio path in a training JSON.

2. Feature Extraction

Whisper or WavLM encoders process the audio waveform into frame-wise feature vectors.
- Whisper uses fixed 20ms frame stride.
- WavLM offers flexible windowing via HuBERT-style encoding.

3. Neural Architecture

The encoded features go through a stack of optional, configurable layers:

BiLSTM - sequential modeling (optional)
Conformer Blocks - long + short-term feature modeling
Dilated Conv Stack - local context enhancement (optional)
Self-Attn Polisher - smoothing and refining predictions (optional, experimental, not recommended)

4. Classification

A linear layer maps each time step to a BIO tag.

5. Inference and Postprocessing

Predict BIO tags from audio.
Optional smoothing (median filtering) and merging for better boundary clarity.
Convert tags back to .lab segments.

Features

Whisper/WavLM encoder support
Frame-level BIO tag training
Configurable architecture (BiLSTM, Conformer, Conv, Attention)
HTK-compatible .lab output format

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
infer.py		infer.py
model.py		model.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WFL-ASR: Whisper/WavLM for Phoneme Labeling

How It Works

1. Label Preprocessing

2. Feature Extraction

3. Neural Architecture

4. Classification

5. Inference and Postprocessing

Features

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WFL-ASR: Whisper/WavLM for Phoneme Labeling

How It Works

1. Label Preprocessing

2. Feature Extraction

3. Neural Architecture

4. Classification

5. Inference and Postprocessing

Features

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages