Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 15 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ for seg in result.segments:
print(f" [{seg.start:.1f}s - {seg.end:.1f}s] {seg.speaker}")
```

**~10.8% DER** on VoxConverse (lower than pyannote's free models). Processes audio **~8x faster than real-time** on CPU. Automatically detects the number of speakers.
**~5.2% weighted DER** on VoxConverse dev. Processes audio **~8x faster than real-time** on CPU. Automatically detects the number of speakers.

> Benchmarked on a single dataset ([VoxConverse](https://github.com/joonson/voxconverse)). Cross-dataset validation is [in progress](#roadmap).

Expand All @@ -35,12 +35,12 @@ for seg in result.segments:
| GPU required | No | No (7x slower on CPU) | No |
| HuggingFace account | No | Yes | Yes |
| Auto speaker count | Yes | Yes | Yes |
| DER (VoxConverse) | **~10.8%** | ~11.2% | ~8.5% |
| DER (VoxConverse dev) | **~5.2%** | ~11.2% | ~8.5% |
| CPU speed (RTF) | **0.12** | 0.86 | — |
| Install | `pip install diarize` | `pip install pyannote.audio` | `pip install pyannote.audio` |

DER = Diarization Error Rate (lower is better). RTF = Real-Time Factor (lower is faster).
pyannote numbers are self-reported from their [benchmark page](https://huggingface.co/pyannote/speaker-diarization-3.1). Full methodology: [benchmarks](https://foxnosetech.github.io/diarize/benchmarks/).
pyannote numbers are self-reported from their [benchmark page](https://huggingface.co/pyannote/speaker-diarization-3.1). The diarize number is from the VoxConverse dev evaluation described in [benchmarks](https://foxnosetech.github.io/diarize/benchmarks/).

## Quick Start

Expand Down Expand Up @@ -88,8 +88,8 @@ Four-stage pipeline, all CPU, all open-source:

1. **Silero VAD** (MIT) — detects speech segments
2. **WeSpeaker ResNet34-LM** (Apache 2.0) — extracts 256-dim speaker embeddings via ONNX
3. **GMM BIC** — estimates the number of speakers
4. **Spectral Clustering** (scikit-learn, BSD) — assigns speaker labels
3. **GMM BIC + silhouette refinement** — estimates the number of speakers
4. **Spectral Clustering** (scikit-learn, BSD) + smoothing — assigns speaker labels

Details: [How It Works](https://foxnosetech.github.io/diarize/how-it-works/)

Expand All @@ -102,28 +102,26 @@ Evaluated on [VoxConverse](https://github.com/joonson/voxconverse) dev set (216
| System | Weighted DER | Notes |
|--------|----------|-------|
| pyannote precision-2 | ~8.5% | Commercial license |
| **diarize** | **~10.8%** | **Apache 2.0, CPU-only, no API key** |
| **diarize** | **~5.2%** | **Apache 2.0, CPU-only, no API key** |
| pyannote community-1 | ~11.2% | CC-BY-4.0, needs HF token |
| pyannote 3.1 (legacy) | ~11.2% | MIT, needs HF token |

### Speaker Count Estimation

| GT Speakers | Files | Exact Match | Within ±1 |
|-------------|-------|-------------|-----------|
| 1 | 22 | 91% | 95% |
| 2 | 44 | 70% | 91% |
| 3 | 35 | 69% | 97% |
| 4 | 24 | 54% | 88% |
| 5 | 31 | 32% | 87% |
| 6–7 | 29 | 45% | 79% |
| 8+ | 31 | 0% | 26% |
| **Overall** | **216** | **51%** | **81%** |
| Metric | Result |
|--------|--------|
| Files | 216 |
| Exact match | 117/216 (54%) |
| Within ±1 | 175/216 (81%) |

Many-speaker files remain the weak spot: automatic count estimation degrades above 7 speakers. Pass `num_speakers` when the count is known.

Full benchmark results, speed comparison, and methodology: [benchmarks](https://foxnosetech.github.io/diarize/benchmarks/).

## When to use something else

- **You need <9% DER.** pyannote's commercial model (precision-2) achieves ~8.5%. If accuracy is the top priority and you have budget, use that.
- **You need commercial support or cross-dataset validation.** pyannote's commercial model has published production-oriented benchmarks beyond this single VoxConverse evaluation. If accuracy is the top priority and you have budget, compare on your own data.
- **You need very stable speaker labels in transcripts.** diarize can still show speaker fragmentation / label switching: one real speaker may be split across multiple `SPEAKER_XX` labels, or the label may briefly jump inside a continuous turn, especially on noisy real-world audio.
- **Your audio has 8+ speakers.** Automatic speaker count estimation degrades above 7 speakers. You can pass `num_speakers` explicitly, but test carefully.
- **You need overlapping speech detection.** diarize assigns each segment to one speaker. Overlapping speech is not modeled.
- **You need GPU-accelerated throughput.** diarize is CPU-only by design. For processing thousands of hours with GPU infrastructure, NeMo or pyannote on GPU will be faster.
Expand Down
50 changes: 26 additions & 24 deletions docs/benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,14 @@ dev set (216 files, 1--20 speakers per file).

## Speaker Count Estimation

| GT Speakers | Files | Exact Match | Within +/-1 |
|-------------|-------|-------------|-------------|
| 1 | 22 | 91% | 95% |
| 2 | 44 | 70% | 91% |
| 3 | 35 | 69% | 97% |
| 4 | 24 | 54% | 88% |
| 5 | 31 | 32% | 87% |
| 6--7 | 29 | 45% | 79% |
| 8+ | 31 | 0% | 26% |
| **Overall** | **216** | **51%** | **81%** |

The algorithm works best for 1--4 speakers (88--97% within +/-1).
Accuracy drops for 8 or more speakers --- see
| Metric | Result |
|--------|--------|
| Files | 216 |
| Exact match | 117/216 (54%) |
| Within +/-1 | 175/216 (81%) |

The automatic estimator is usually close, but exact counting remains the
main weak spot. Accuracy drops for many-speaker files --- see
[Limitations](#limitations) below.

## Diarization Error Rate (DER)
Expand All @@ -28,18 +23,20 @@ DER is the standard metric for speaker diarization, computed with
| System | Weighted DER | Median DER | Notes |
|--------|----------|------------|-------|
| pyannote precision-2 | ~8.5% | -- | Commercial license |
| **diarize** | **~10.8%** | **~3.7%** | **Apache 2.0, CPU-only, no API key** |
| **diarize** | **~5.2%** | **~2.4%** | **Apache 2.0, CPU-only, no API key** |
| pyannote community-1 | ~11.2% | -- | CC-BY-4.0, needs HF token |
| pyannote 3.1 (legacy) | ~11.2% | -- | MIT, needs HF token |

pyannote DER numbers are self-reported from the
[pyannote benchmark page](https://huggingface.co/pyannote/speaker-diarization-3.1)
on VoxConverse v0.3.

!!! note "Better than pyannote 3.1 on VoxConverse"
`diarize` achieves lower DER than both pyannote 3.1 (legacy) and
community-1 on VoxConverse, while requiring no HuggingFace token
or account registration.
!!! note "VoxConverse-only result"
On this VoxConverse dev evaluation, `diarize` reports lower weighted
DER than the published pyannote VoxConverse figures, while requiring
no HuggingFace token or account registration. Treat this as a
single-dataset benchmark and compare on your own audio when accuracy
is the top priority.

## CPU Speed (Real Time Factor)

Expand Down Expand Up @@ -82,9 +79,10 @@ Measured on VoxConverse dev files on Apple M2 Pro / M2 Max
## Limitations

!!! warning "Speaker count > 7"
The GMM BIC speaker-count estimator with silhouette refinement works
well for **1--5 speakers** and degrades gradually for 6--7. For
**8 or more speakers** it tends to undercount and produces higher DER.
The GMM BIC speaker-count estimator with silhouette refinement is
usually close on VoxConverse dev, but many-speaker files remain the
hardest case. For **8 or more speakers** it can undercount and
produce higher DER.
If you know your audio has many speakers, pass ``num_speakers``
explicitly:

Expand All @@ -94,9 +92,13 @@ Measured on VoxConverse dev files on Apple M2 Pro / M2 Max

**Known limitations:**

- **Many speakers (8+):** Automatic speaker count estimation degrades ---
GMM BIC with silhouette refinement reaches 26% within-one accuracy
for 8+ speakers. Use ``num_speakers`` when the speaker count is known.
- **Many speakers (8+):** Automatic speaker count estimation degrades.
Use ``num_speakers`` when the speaker count is known.
- **Speaker label switching / fragmentation:** On noisy real-world audio,
one actual speaker can be split across multiple ``SPEAKER_XX`` labels,
or the label can briefly jump inside a continuous turn. This is mostly
a clustering and embedding-assignment limitation, and it is visible in
transcripts even when aggregate DER looks acceptable.
- **Overlapping speech:** DER is computed with ``skip_overlap=True``.
The pipeline does not model overlapping speech --- when two people
talk simultaneously, only one is labelled.
Expand Down
28 changes: 21 additions & 7 deletions docs/how-it-works.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ Audio File
[2] WeSpeaker ResNet34-LM -> 256-dim speaker embeddings
|
v
[3] GMM BIC -------------> Estimated speaker count (k)
[3] GMM BIC + silhouette -> Estimated speaker count (k)
|
v
[4] Spectral Clustering --> Speaker labels
[4] Spectral + smoothing -> Speaker timeline
|
v
DiarizeResult
Expand Down Expand Up @@ -54,7 +54,7 @@ See: [`extract_embeddings()`](api.md#diarize.embeddings.extract_embeddings)
## Stage 3: Speaker Count Estimation

Unless the user provides `num_speakers`, the pipeline estimates how many
speakers are present in two steps:
speakers are present in three steps:

**Step 1 --- Cosine similarity pre-check.** Compute pairwise cosine
similarities of the L2-normalised embeddings. If the 10th percentile
Expand All @@ -74,12 +74,19 @@ Criterion (BIC)**:
The PCA=8 setting provides a good balance: stable estimation for 2--7
speakers while keeping computational cost low.

**Step 3 --- Silhouette refinement.** BIC is used as an anchor, then a
small neighbourhood around it is scored with silhouette over cosine
distance. The candidate range is clamped by `min_speakers`,
`max_speakers`, and the number of available embeddings. This catches
some BIC undercounts and overcounts without searching the full range.

!!! warning
For **8 or more speakers** the estimator systematically undercounts.
For **8 or more speakers** the estimator can undercount.
Pass ``num_speakers`` explicitly when the speaker count is known.
See [Benchmarks --- Limitations](benchmarks.md#limitations).

See: [`estimate_speakers()`](api.md#diarize.clustering.estimate_speakers)
and [`cluster_auto()`](api.md#diarize.clustering.cluster_auto)

## Stage 4: Spectral Clustering

Expand All @@ -88,9 +95,16 @@ the embedding vectors using cosine similarity as the affinity metric.
The cosine similarity matrix is rescaled to [0, 1] and passed to
scikit-learn's `SpectralClustering`.

Adjacent subsegments assigned to the same speaker are merged, and short
segments that were skipped during embedding extraction are assigned the
label of the nearest speaker.
The initial spectral labels are refined with spherical centroid
reassignment over L2-normalised embeddings. This preserves the selected
speaker count while reducing unstable one-window label flips.

For long VAD segments, overlapping embedding windows are decoded into
non-overlapping timeline intervals using window-center midpoints. A
3-window majority filter smooths local label noise. Adjacent intervals
assigned to the same speaker are merged, and short segments that were
skipped during embedding extraction are assigned the label of the nearest
speaker.

See: [`cluster_spectral()`](api.md#diarize.clustering.cluster_spectral)

Expand Down
5 changes: 3 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,13 @@ for seg in result.segments:
| GPU required | No | No (7x slower on CPU) | No |
| HuggingFace account | No | Yes | Yes |
| Auto speaker count | Yes | Yes | Yes |
| DER (VoxConverse) | **~10.8%** | ~11.2% | ~8.5% |
| DER (VoxConverse dev) | **~5.2%** | ~11.2% | ~8.5% |
| CPU speed (RTF) | **0.12** | 0.86 | --- |

DER and speed numbers for pyannote are from their
[benchmark page](https://huggingface.co/pyannote/speaker-diarization-3.1).
Full methodology: [Benchmarks](benchmarks.md).
The diarize number is from the VoxConverse dev evaluation described in
[Benchmarks](benchmarks.md).

## Next Steps

Expand Down
Loading