WFL-ASR is a configurable deep learning model designed for automatic phoneme segmentation using frame-level BIO tagging. It supports both Whisper and WavLM as audio encoders, and is structured for flexible and efficient training on phoneme-aligned datasets.
This model performs frame-level phoneme labeling using the BIO tag format (B-, I-, O).
.labfiles define phoneme segments using HTK format.- Each segment is converted into BIO tags aligned to time frames based on
frame_duration(hardcoded to 20ms for Whisper compatibility). - Tags are stored along with the audio path in a training JSON.
- Whisper or WavLM encoders process the audio waveform into frame-wise feature vectors.
- Whisper uses fixed 20ms frame stride.
- WavLM offers flexible windowing via HuBERT-style encoding.
The encoded features go through a stack of optional, configurable layers:
BiLSTM- sequential modeling (optional)Conformer Blocks- long + short-term feature modelingDilated Conv Stack- local context enhancement (optional)Self-Attn Polisher- smoothing and refining predictions (optional, experimental, not recommended)
- A linear layer maps each time step to a BIO tag.
- Predict BIO tags from audio.
- Optional smoothing (median filtering) and merging for better boundary clarity.
- Convert tags back to
.labsegments.
- Whisper/WavLM encoder support
- Frame-level BIO tag training
- Configurable architecture (BiLSTM, Conformer, Conv, Attention)
- HTK-compatible
.laboutput format