diff --git a/README.md b/README.md index fd4e940..3dd6b26 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ for seg in result.segments: print(f" [{seg.start:.1f}s - {seg.end:.1f}s] {seg.speaker}") ``` -**~10.8% DER** on VoxConverse (lower than pyannote's free models). Processes audio **~8x faster than real-time** on CPU. Automatically detects the number of speakers. +**~5.2% weighted DER** on VoxConverse dev. Processes audio **~8x faster than real-time** on CPU. Automatically detects the number of speakers. > Benchmarked on a single dataset ([VoxConverse](https://github.com/joonson/voxconverse)). Cross-dataset validation is [in progress](#roadmap). @@ -35,12 +35,12 @@ for seg in result.segments: | GPU required | No | No (7x slower on CPU) | No | | HuggingFace account | No | Yes | Yes | | Auto speaker count | Yes | Yes | Yes | -| DER (VoxConverse) | **~10.8%** | ~11.2% | ~8.5% | +| DER (VoxConverse dev) | **~5.2%** | ~11.2% | ~8.5% | | CPU speed (RTF) | **0.12** | 0.86 | — | | Install | `pip install diarize` | `pip install pyannote.audio` | `pip install pyannote.audio` | DER = Diarization Error Rate (lower is better). RTF = Real-Time Factor (lower is faster). -pyannote numbers are self-reported from their [benchmark page](https://huggingface.co/pyannote/speaker-diarization-3.1). Full methodology: [benchmarks](https://foxnosetech.github.io/diarize/benchmarks/). +pyannote numbers are self-reported from their [benchmark page](https://huggingface.co/pyannote/speaker-diarization-3.1). The diarize number is from the VoxConverse dev evaluation described in [benchmarks](https://foxnosetech.github.io/diarize/benchmarks/). ## Quick Start @@ -88,8 +88,8 @@ Four-stage pipeline, all CPU, all open-source: 1. **Silero VAD** (MIT) — detects speech segments 2. **WeSpeaker ResNet34-LM** (Apache 2.0) — extracts 256-dim speaker embeddings via ONNX -3. **GMM BIC** — estimates the number of speakers -4. **Spectral Clustering** (scikit-learn, BSD) — assigns speaker labels +3. **GMM BIC + silhouette refinement** — estimates the number of speakers +4. **Spectral Clustering** (scikit-learn, BSD) + smoothing — assigns speaker labels Details: [How It Works](https://foxnosetech.github.io/diarize/how-it-works/) @@ -102,28 +102,26 @@ Evaluated on [VoxConverse](https://github.com/joonson/voxconverse) dev set (216 | System | Weighted DER | Notes | |--------|----------|-------| | pyannote precision-2 | ~8.5% | Commercial license | -| **diarize** | **~10.8%** | **Apache 2.0, CPU-only, no API key** | +| **diarize** | **~5.2%** | **Apache 2.0, CPU-only, no API key** | | pyannote community-1 | ~11.2% | CC-BY-4.0, needs HF token | | pyannote 3.1 (legacy) | ~11.2% | MIT, needs HF token | ### Speaker Count Estimation -| GT Speakers | Files | Exact Match | Within ±1 | -|-------------|-------|-------------|-----------| -| 1 | 22 | 91% | 95% | -| 2 | 44 | 70% | 91% | -| 3 | 35 | 69% | 97% | -| 4 | 24 | 54% | 88% | -| 5 | 31 | 32% | 87% | -| 6–7 | 29 | 45% | 79% | -| 8+ | 31 | 0% | 26% | -| **Overall** | **216** | **51%** | **81%** | +| Metric | Result | +|--------|--------| +| Files | 216 | +| Exact match | 117/216 (54%) | +| Within ±1 | 175/216 (81%) | + +Many-speaker files remain the weak spot: automatic count estimation degrades above 7 speakers. Pass `num_speakers` when the count is known. Full benchmark results, speed comparison, and methodology: [benchmarks](https://foxnosetech.github.io/diarize/benchmarks/). ## When to use something else -- **You need <9% DER.** pyannote's commercial model (precision-2) achieves ~8.5%. If accuracy is the top priority and you have budget, use that. +- **You need commercial support or cross-dataset validation.** pyannote's commercial model has published production-oriented benchmarks beyond this single VoxConverse evaluation. If accuracy is the top priority and you have budget, compare on your own data. +- **You need very stable speaker labels in transcripts.** diarize can still show speaker fragmentation / label switching: one real speaker may be split across multiple `SPEAKER_XX` labels, or the label may briefly jump inside a continuous turn, especially on noisy real-world audio. - **Your audio has 8+ speakers.** Automatic speaker count estimation degrades above 7 speakers. You can pass `num_speakers` explicitly, but test carefully. - **You need overlapping speech detection.** diarize assigns each segment to one speaker. Overlapping speech is not modeled. - **You need GPU-accelerated throughput.** diarize is CPU-only by design. For processing thousands of hours with GPU infrastructure, NeMo or pyannote on GPU will be faster. diff --git a/docs/benchmarks.md b/docs/benchmarks.md index 041bb4e..af6037a 100644 --- a/docs/benchmarks.md +++ b/docs/benchmarks.md @@ -5,19 +5,14 @@ dev set (216 files, 1--20 speakers per file). ## Speaker Count Estimation -| GT Speakers | Files | Exact Match | Within +/-1 | -|-------------|-------|-------------|-------------| -| 1 | 22 | 91% | 95% | -| 2 | 44 | 70% | 91% | -| 3 | 35 | 69% | 97% | -| 4 | 24 | 54% | 88% | -| 5 | 31 | 32% | 87% | -| 6--7 | 29 | 45% | 79% | -| 8+ | 31 | 0% | 26% | -| **Overall** | **216** | **51%** | **81%** | - -The algorithm works best for 1--4 speakers (88--97% within +/-1). -Accuracy drops for 8 or more speakers --- see +| Metric | Result | +|--------|--------| +| Files | 216 | +| Exact match | 117/216 (54%) | +| Within +/-1 | 175/216 (81%) | + +The automatic estimator is usually close, but exact counting remains the +main weak spot. Accuracy drops for many-speaker files --- see [Limitations](#limitations) below. ## Diarization Error Rate (DER) @@ -28,7 +23,7 @@ DER is the standard metric for speaker diarization, computed with | System | Weighted DER | Median DER | Notes | |--------|----------|------------|-------| | pyannote precision-2 | ~8.5% | -- | Commercial license | -| **diarize** | **~10.8%** | **~3.7%** | **Apache 2.0, CPU-only, no API key** | +| **diarize** | **~5.2%** | **~2.4%** | **Apache 2.0, CPU-only, no API key** | | pyannote community-1 | ~11.2% | -- | CC-BY-4.0, needs HF token | | pyannote 3.1 (legacy) | ~11.2% | -- | MIT, needs HF token | @@ -36,10 +31,12 @@ pyannote DER numbers are self-reported from the [pyannote benchmark page](https://huggingface.co/pyannote/speaker-diarization-3.1) on VoxConverse v0.3. -!!! note "Better than pyannote 3.1 on VoxConverse" - `diarize` achieves lower DER than both pyannote 3.1 (legacy) and - community-1 on VoxConverse, while requiring no HuggingFace token - or account registration. +!!! note "VoxConverse-only result" + On this VoxConverse dev evaluation, `diarize` reports lower weighted + DER than the published pyannote VoxConverse figures, while requiring + no HuggingFace token or account registration. Treat this as a + single-dataset benchmark and compare on your own audio when accuracy + is the top priority. ## CPU Speed (Real Time Factor) @@ -82,9 +79,10 @@ Measured on VoxConverse dev files on Apple M2 Pro / M2 Max ## Limitations !!! warning "Speaker count > 7" - The GMM BIC speaker-count estimator with silhouette refinement works - well for **1--5 speakers** and degrades gradually for 6--7. For - **8 or more speakers** it tends to undercount and produces higher DER. + The GMM BIC speaker-count estimator with silhouette refinement is + usually close on VoxConverse dev, but many-speaker files remain the + hardest case. For **8 or more speakers** it can undercount and + produce higher DER. If you know your audio has many speakers, pass ``num_speakers`` explicitly: @@ -94,9 +92,13 @@ Measured on VoxConverse dev files on Apple M2 Pro / M2 Max **Known limitations:** -- **Many speakers (8+):** Automatic speaker count estimation degrades --- - GMM BIC with silhouette refinement reaches 26% within-one accuracy - for 8+ speakers. Use ``num_speakers`` when the speaker count is known. +- **Many speakers (8+):** Automatic speaker count estimation degrades. + Use ``num_speakers`` when the speaker count is known. +- **Speaker label switching / fragmentation:** On noisy real-world audio, + one actual speaker can be split across multiple ``SPEAKER_XX`` labels, + or the label can briefly jump inside a continuous turn. This is mostly + a clustering and embedding-assignment limitation, and it is visible in + transcripts even when aggregate DER looks acceptable. - **Overlapping speech:** DER is computed with ``skip_overlap=True``. The pipeline does not model overlapping speech --- when two people talk simultaneously, only one is labelled. diff --git a/docs/how-it-works.md b/docs/how-it-works.md index 3a4529c..ed859a5 100644 --- a/docs/how-it-works.md +++ b/docs/how-it-works.md @@ -15,10 +15,10 @@ Audio File [2] WeSpeaker ResNet34-LM -> 256-dim speaker embeddings | v -[3] GMM BIC -------------> Estimated speaker count (k) +[3] GMM BIC + silhouette -> Estimated speaker count (k) | v -[4] Spectral Clustering --> Speaker labels +[4] Spectral + smoothing -> Speaker timeline | v DiarizeResult @@ -54,7 +54,7 @@ See: [`extract_embeddings()`](api.md#diarize.embeddings.extract_embeddings) ## Stage 3: Speaker Count Estimation Unless the user provides `num_speakers`, the pipeline estimates how many -speakers are present in two steps: +speakers are present in three steps: **Step 1 --- Cosine similarity pre-check.** Compute pairwise cosine similarities of the L2-normalised embeddings. If the 10th percentile @@ -74,12 +74,19 @@ Criterion (BIC)**: The PCA=8 setting provides a good balance: stable estimation for 2--7 speakers while keeping computational cost low. +**Step 3 --- Silhouette refinement.** BIC is used as an anchor, then a +small neighbourhood around it is scored with silhouette over cosine +distance. The candidate range is clamped by `min_speakers`, +`max_speakers`, and the number of available embeddings. This catches +some BIC undercounts and overcounts without searching the full range. + !!! warning - For **8 or more speakers** the estimator systematically undercounts. + For **8 or more speakers** the estimator can undercount. Pass ``num_speakers`` explicitly when the speaker count is known. See [Benchmarks --- Limitations](benchmarks.md#limitations). See: [`estimate_speakers()`](api.md#diarize.clustering.estimate_speakers) +and [`cluster_auto()`](api.md#diarize.clustering.cluster_auto) ## Stage 4: Spectral Clustering @@ -88,9 +95,16 @@ the embedding vectors using cosine similarity as the affinity metric. The cosine similarity matrix is rescaled to [0, 1] and passed to scikit-learn's `SpectralClustering`. -Adjacent subsegments assigned to the same speaker are merged, and short -segments that were skipped during embedding extraction are assigned the -label of the nearest speaker. +The initial spectral labels are refined with spherical centroid +reassignment over L2-normalised embeddings. This preserves the selected +speaker count while reducing unstable one-window label flips. + +For long VAD segments, overlapping embedding windows are decoded into +non-overlapping timeline intervals using window-center midpoints. A +3-window majority filter smooths local label noise. Adjacent intervals +assigned to the same speaker are merged, and short segments that were +skipped during embedding extraction are assigned the label of the nearest +speaker. See: [`cluster_spectral()`](api.md#diarize.clustering.cluster_spectral) diff --git a/docs/index.md b/docs/index.md index 08664b2..4d30013 100644 --- a/docs/index.md +++ b/docs/index.md @@ -29,12 +29,13 @@ for seg in result.segments: | GPU required | No | No (7x slower on CPU) | No | | HuggingFace account | No | Yes | Yes | | Auto speaker count | Yes | Yes | Yes | -| DER (VoxConverse) | **~10.8%** | ~11.2% | ~8.5% | +| DER (VoxConverse dev) | **~5.2%** | ~11.2% | ~8.5% | | CPU speed (RTF) | **0.12** | 0.86 | --- | DER and speed numbers for pyannote are from their [benchmark page](https://huggingface.co/pyannote/speaker-diarization-3.1). -Full methodology: [Benchmarks](benchmarks.md). +The diarize number is from the VoxConverse dev evaluation described in +[Benchmarks](benchmarks.md). ## Next Steps