Skip to content

PalabraAI/forced-aligners-bench

Repository files navigation

Forced Alignment Benchmark

Benchmarking forced alignment models by measuring how well their word boundaries allow cropped audio to be correctly re-transcribed.

Approach

Standard forced alignment evaluation requires ground-truth word timestamps, which are expensive to obtain. This benchmark uses an implicit evaluation via Word Error Rate (WER):

  1. Run a forced aligner on (audio, transcript) pairs to get word-level timestamps
  2. Crop audio at the aligned word boundaries
  3. Transcribe each crop with an independent ASR model (Whisper large-v3-turbo)
  4. Compute WER between the crop transcription and the expected text

Better alignment = tighter crops = lower WER. If word boundaries are accurate, each crop contains exactly the target word(s) and Whisper transcribes them correctly. Misaligned boundaries cut words or include neighboring audio, causing transcription errors.

To isolate alignment quality from aligner ASR quality, results are filtered to utterances where all aligners produced perfect transcription reconstruction (strict mode). Inner words only (slice [1:-1]) are used to avoid edge effects at utterance boundaries.

Aligners

Aligner Type Model
Qwen3 Transformer forced aligner Qwen/Qwen3-ForcedAligner-0.6B
Seamless NAR T2U + unit extraction Meta Seamless UnitY2
WhisperX CTC (wav2vec2) Language-specific wav2vec2 models

Results

Dataset: FLEURS test split, strict filtering (intersection of utterances with perfect transcription across all aligners), inner words only ([1:-1]).

Mean WER by language

Aligner EN ES FR RU
Seamless 0.067 0.030 0.102 0.055
Qwen3 0.072 0.039 0.104 0.059
WhisperX 0.082 0.065 0.155 0.133

Lower is better. Bold = best per language.

Usage

Setup

pip install -e .
# or: uv pip install -e .

Download data

bash get_data.sh

Run alignments

# Single aligner
python extract_alignment.py cache/data/fleurs/en/test/meta.csv \
    -a qwen3 -l en -g 0

# All aligners, all languages
bash run_aligners.sh

Run benchmark

# Single language
python bench_alignment.py cache/data/fleurs/en/test/ \
    -s "1,-1" -l en --batch-size 64 --filter-perfect --strict

# All languages
bash run_transcribe.sh

Repository structure

bench_alignment.py      # Benchmark: crop -> transcribe -> WER
extract_alignment.py    # Run aligners on dataset, multi-GPU
get_data.sh             # Download FLEURS dataset
run_aligners.sh         # Batch alignment extraction
run_transcribe.sh       # Batch benchmarking
aligners/               # Aligner wrappers (unified interface)
  qwen3/                # Qwen3-ForcedAligner
  whisperx/             # Patched WhisperX
  seamless/             # Seamless UnitY2 (bundled deps, MIT-only subset)
dataprep/               # Dataset download & preparation

Methodology limitations

A few caveats worth knowing before reading the table above:

  • Sample sizes are small. After strict filtering (only utterances where every aligner reconstructed the transcript perfectly), the per-language cohorts shrink to 251–556 utterances. Differences between systems within ~1 percentage point are well inside the noise floor; the standard deviations across utterances are 5–10 points wide. We do not publish significance tests — treat the table as a coarse ranking, not a precise leaderboard.
  • Single judge ASR. All crops are re-transcribed by a single model (openai/whisper-large-v3-turbo). A different judge model could reorder the table, especially in the gap between Qwen3 and Seamless. The benchmark measures how cleanly the crop is recognised by Whisper, which is a useful proxy for alignment tightness but not the same thing as ground-truth boundary error.
  • FLEURS-only, four languages. FLEURS is clean studio-style read speech in English, Spanish, French, and Russian. Numbers here do not predict aligner behavior on conversational, noisy, accented, or out-of-domain audio, which is where most production use cases live.
  • The "strict" filter biases the cohort. Intersecting the perfect-reconstruction sets across all aligners discards every utterance any one aligner stumbled on, so the surviving cohort is the easy cases for the worst aligner. Aligners that fail on more utterances effectively get judged on a smaller, easier set. We chose this to make the comparison apples-to-apples on a fixed audio set, but it deflates differences on hard audio.
  • Inner words only. The reported numbers use slice [1:-1] to remove the first and last word of each utterance, where edge effects (model warmup, trailing silence) hurt all aligners. This makes the numbers prettier but ignores a real failure mode.
  • No latency / cost / footprint axis. The table compares accuracy only. The three aligners have very different compute profiles (Qwen3-0.6B vs the full UnitY2 stack vs language-specific wav2vec2). Pick based on your deployment constraints, not just the WER.

If you want to push these results further, the most informative additions would be: (a) confidence intervals via bootstrap, (b) a second judge ASR, and (c) at least one conversational dataset.

Related work

This benchmark is complementary to — not a replacement for — the existing forced alignment evaluation literature. The closest reference points:

  • Aligner-SUPERB (lifeiteng, 2024). Benchmarks MFA, NeMo Forced Aligner, ctc-forced-aligner, WhisperX, and the Lhotse aligners using ground-truth Word/Utterance Boundary Error on TIMIT. Different metric (boundary error vs. our crop WER), different aligner set (no Seamless or Qwen3), English-only.
  • "Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment" (Interspeech 2024). Compares MFA, MMS, and WhisperX against ground-truth timestamps on TIMIT and Buckeye.
  • Qwen3-ASR Technical Report (2026). Introduces Qwen3-ForcedAligner and benchmarks it against WhisperX, NeMo Forced Aligner, and Monotonic-Aligner using Accumulated Average Shift (AAS) on MFA-labeled audio.
  • "Whisper Has an Internal Word Aligner" (2025). Extracts word boundaries from Whisper attention maps and compares against MFA / WhisperX / CrisperWhisper on ground-truth aligned data.
  • "Word Level Timestamp Generation for ASR and Translation" (Interspeech 2025). Trains models to predict word timestamps directly, using NeMo Forced Aligner as a teacher; evaluates on FLEURS for translation timestamping.

What this repo adds on top of all of the above:

  1. No ground-truth timestamps required. Every benchmark above measures alignment quality against hand-labeled or semi-automatic word boundaries (TIMIT, Buckeye, MFA-labeled). That ties them to a small number of languages and domains. The crop-and-re-transcribe approach works on any (audio, transcript) corpus in any language for which you have a usable judge ASR.
  2. A different aligner combination. Qwen3-ForcedAligner, Seamless UnitY2, and WhisperX have not been head-to-head before in the public literature we could find — Qwen3's own report excludes Seamless, and the academic benchmarks predate Qwen3.
  3. Code, not just numbers. The aligner wrappers expose a uniform (audio, transcript, language) -> word-level timestamps interface, so adding a new aligner to the comparison is a single Python class.

Adjacent ideas worth knowing about: data-curation pipelines such as Transcribe, Align and Segment use per-segment WER/CER as a filter to discard noisy training data — the same intuition (a good alignment yields a low re-transcription error) used as a quality gate rather than as a comparative benchmark metric.

About

Benchmark forced alignment models (Qwen3-FA, WhisperX, Seamless UnitY2) on FLEURS via implicit WER on cropped audio.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors