FoxNoseTech · loookashow · May 4, 2026 · May 4, 2026
diff --git a/README.md b/README.md
@@ -23,7 +23,7 @@ for seg in result.segments:
     print(f"  [{seg.start:.1f}s - {seg.end:.1f}s] {seg.speaker}")
 ```
 
-**~10.8% DER** on VoxConverse (lower than pyannote's free models). Processes audio **~8x faster than real-time** on CPU. Automatically detects the number of speakers.
+**~5.2% weighted DER** on VoxConverse dev. Processes audio **~8x faster than real-time** on CPU. Automatically detects the number of speakers.
 
 > Benchmarked on a single dataset ([VoxConverse](https://github.com/joonson/voxconverse)). Cross-dataset validation is [in progress](#roadmap).
 
@@ -35,12 +35,12 @@ for seg in result.segments:
 | GPU required | No | No (7x slower on CPU) | No |
 | HuggingFace account | No | Yes | Yes |
 | Auto speaker count | Yes | Yes | Yes |
-| DER (VoxConverse) | **~10.8%** | ~11.2% | ~8.5% |
+| DER (VoxConverse dev) | **~5.2%** | ~11.2% | ~8.5% |
 | CPU speed (RTF) | **0.12** | 0.86 | — |
 | Install | `pip install diarize` | `pip install pyannote.audio` | `pip install pyannote.audio` |
 
 DER = Diarization Error Rate (lower is better). RTF = Real-Time Factor (lower is faster).
-pyannote numbers are self-reported from their [benchmark page](https://huggingface.co/pyannote/speaker-diarization-3.1). Full methodology: [benchmarks](https://foxnosetech.github.io/diarize/benchmarks/).
+pyannote numbers are self-reported from their [benchmark page](https://huggingface.co/pyannote/speaker-diarization-3.1). The diarize number is from the VoxConverse dev evaluation described in [benchmarks](https://foxnosetech.github.io/diarize/benchmarks/).
 
 ## Quick Start
 
@@ -88,8 +88,8 @@ Four-stage pipeline, all CPU, all open-source:
 
 1. **Silero VAD** (MIT) — detects speech segments
 2. **WeSpeaker ResNet34-LM** (Apache 2.0) — extracts 256-dim speaker embeddings via ONNX
-3. **GMM BIC** — estimates the number of speakers
-4. **Spectral Clustering** (scikit-learn, BSD) — assigns speaker labels
+3. **GMM BIC + silhouette refinement** — estimates the number of speakers
+4. **Spectral Clustering** (scikit-learn, BSD) + smoothing — assigns speaker labels
 
 Details: [How It Works](https://foxnosetech.github.io/diarize/how-it-works/)
 
@@ -102,28 +102,26 @@ Evaluated on [VoxConverse](https://github.com/joonson/voxconverse) dev set (216
 | System | Weighted DER | Notes |
 |--------|----------|-------|
 | pyannote precision-2 | ~8.5% | Commercial license |
-| **diarize** | **~10.8%** | **Apache 2.0, CPU-only, no API key** |
+| **diarize** | **~5.2%** | **Apache 2.0, CPU-only, no API key** |
 | pyannote community-1 | ~11.2% | CC-BY-4.0, needs HF token |
 | pyannote 3.1 (legacy) | ~11.2% | MIT, needs HF token |
 
 ### Speaker Count Estimation
 
-| GT Speakers | Files | Exact Match | Within ±1 |
-|-------------|-------|-------------|-----------|
-| 1 | 22 | 91% | 95% |
-| 2 | 44 | 70% | 91% |
-| 3 | 35 | 69% | 97% |
-| 4 | 24 | 54% | 88% |
-| 5 | 31 | 32% | 87% |
-| 6–7 | 29 | 45% | 79% |
-| 8+ | 31 | 0% | 26% |
-| **Overall** | **216** | **51%** | **81%** |
+| Metric | Result |
+|--------|--------|
+| Files | 216 |
+| Exact match | 117/216 (54%) |
+| Within ±1 | 175/216 (81%) |
+
+Many-speaker files remain the weak spot: automatic count estimation degrades above 7 speakers. Pass `num_speakers` when the count is known.
 
 Full benchmark results, speed comparison, and methodology: [benchmarks](https://foxnosetech.github.io/diarize/benchmarks/).
 
 ## When to use something else
 
-- **You need <9% DER.** pyannote's commercial model (precision-2) achieves ~8.5%. If accuracy is the top priority and you have budget, use that.
+- **You need commercial support or cross-dataset validation.** pyannote's commercial model has published production-oriented benchmarks beyond this single VoxConverse evaluation. If accuracy is the top priority and you have budget, compare on your own data.
+- **You need very stable speaker labels in transcripts.** diarize can still show speaker fragmentation / label switching: one real speaker may be split across multiple `SPEAKER_XX` labels, or the label may briefly jump inside a continuous turn, especially on noisy real-world audio.
 - **Your audio has 8+ speakers.** Automatic speaker count estimation degrades above 7 speakers. You can pass `num_speakers` explicitly, but test carefully.
 - **You need overlapping speech detection.** diarize assigns each segment to one speaker. Overlapping speech is not modeled.
 - **You need GPU-accelerated throughput.** diarize is CPU-only by design. For processing thousands of hours with GPU infrastructure, NeMo or pyannote on GPU will be faster.

diff --git a/docs/benchmarks.md b/docs/benchmarks.md
@@ -5,19 +5,14 @@ dev set (216 files, 1--20 speakers per file).
 
 ## Speaker Count Estimation
 
-| GT Speakers | Files | Exact Match | Within +/-1 |
-|-------------|-------|-------------|-------------|
-| 1 | 22 | 91% | 95% |
-| 2 | 44 | 70% | 91% |
-| 3 | 35 | 69% | 97% |
-| 4 | 24 | 54% | 88% |
-| 5 | 31 | 32% | 87% |
-| 6--7 | 29 | 45% | 79% |
-| 8+ | 31 | 0% | 26% |
-| **Overall** | **216** | **51%** | **81%** |
-
-The algorithm works best for 1--4 speakers (88--97% within +/-1).
-Accuracy drops for 8 or more speakers --- see
+| Metric | Result |
+|--------|--------|
+| Files | 216 |
+| Exact match | 117/216 (54%) |
+| Within +/-1 | 175/216 (81%) |
+
+The automatic estimator is usually close, but exact counting remains the
+main weak spot. Accuracy drops for many-speaker files --- see
 [Limitations](#limitations) below.
 
 ## Diarization Error Rate (DER)
@@ -28,18 +23,20 @@ DER is the standard metric for speaker diarization, computed with
 | System | Weighted DER | Median DER | Notes |
 |--------|----------|------------|-------|
 | pyannote precision-2 | ~8.5% | -- | Commercial license |
-| **diarize** | **~10.8%** | **~3.7%** | **Apache 2.0, CPU-only, no API key** |
+| **diarize** | **~5.2%** | **~2.4%** | **Apache 2.0, CPU-only, no API key** |
 | pyannote community-1 | ~11.2% | -- | CC-BY-4.0, needs HF token |
 | pyannote 3.1 (legacy) | ~11.2% | -- | MIT, needs HF token |
 
 pyannote DER numbers are self-reported from the
 [pyannote benchmark page](https://huggingface.co/pyannote/speaker-diarization-3.1)
 on VoxConverse v0.3.
 
-!!! note "Better than pyannote 3.1 on VoxConverse"
-    `diarize` achieves lower DER than both pyannote 3.1 (legacy) and
-    community-1 on VoxConverse, while requiring no HuggingFace token
-    or account registration.
+!!! note "VoxConverse-only result"
+    On this VoxConverse dev evaluation, `diarize` reports lower weighted
+    DER than the published pyannote VoxConverse figures, while requiring
+    no HuggingFace token or account registration. Treat this as a
+    single-dataset benchmark and compare on your own audio when accuracy
+    is the top priority.
 
 ## CPU Speed (Real Time Factor)
 
@@ -82,9 +79,10 @@ Measured on VoxConverse dev files on Apple M2 Pro / M2 Max
 ## Limitations
 
 !!! warning "Speaker count > 7"
-    The GMM BIC speaker-count estimator with silhouette refinement works
-    well for **1--5 speakers** and degrades gradually for 6--7.  For
-    **8 or more speakers** it tends to undercount and produces higher DER.
+    The GMM BIC speaker-count estimator with silhouette refinement is
+    usually close on VoxConverse dev, but many-speaker files remain the
+    hardest case. For **8 or more speakers** it can undercount and
+    produce higher DER.
     If you know your audio has many speakers, pass ``num_speakers``
     explicitly:
 
@@ -94,9 +92,13 @@ Measured on VoxConverse dev files on Apple M2 Pro / M2 Max
 
 **Known limitations:**
 
-- **Many speakers (8+):** Automatic speaker count estimation degrades ---
-  GMM BIC with silhouette refinement reaches 26% within-one accuracy
-  for 8+ speakers.  Use ``num_speakers`` when the speaker count is known.
+- **Many speakers (8+):** Automatic speaker count estimation degrades.
+  Use ``num_speakers`` when the speaker count is known.
+- **Speaker label switching / fragmentation:** On noisy real-world audio,
+  one actual speaker can be split across multiple ``SPEAKER_XX`` labels,
+  or the label can briefly jump inside a continuous turn. This is mostly
+  a clustering and embedding-assignment limitation, and it is visible in
+  transcripts even when aggregate DER looks acceptable.
 - **Overlapping speech:** DER is computed with ``skip_overlap=True``.
   The pipeline does not model overlapping speech --- when two people
   talk simultaneously, only one is labelled.

diff --git a/docs/how-it-works.md b/docs/how-it-works.md
@@ -15,10 +15,10 @@ Audio File
 [2] WeSpeaker ResNet34-LM -> 256-dim speaker embeddings
     |
     v
-[3] GMM BIC -------------> Estimated speaker count (k)
+[3] GMM BIC + silhouette -> Estimated speaker count (k)
     |
     v
-[4] Spectral Clustering --> Speaker labels
+[4] Spectral + smoothing -> Speaker timeline
     |
     v
 DiarizeResult
@@ -54,7 +54,7 @@ See: [`extract_embeddings()`](api.md#diarize.embeddings.extract_embeddings)
 ## Stage 3: Speaker Count Estimation
 
 Unless the user provides `num_speakers`, the pipeline estimates how many
-speakers are present in two steps:
+speakers are present in three steps:
 
 **Step 1 --- Cosine similarity pre-check.** Compute pairwise cosine
 similarities of the L2-normalised embeddings. If the 10th percentile
@@ -74,12 +74,19 @@ Criterion (BIC)**:
 The PCA=8 setting provides a good balance: stable estimation for 2--7
 speakers while keeping computational cost low.
 
+**Step 3 --- Silhouette refinement.** BIC is used as an anchor, then a
+small neighbourhood around it is scored with silhouette over cosine
+distance. The candidate range is clamped by `min_speakers`,
+`max_speakers`, and the number of available embeddings. This catches
+some BIC undercounts and overcounts without searching the full range.
+
 !!! warning
-    For **8 or more speakers** the estimator systematically undercounts.
+    For **8 or more speakers** the estimator can undercount.
     Pass ``num_speakers`` explicitly when the speaker count is known.
     See [Benchmarks --- Limitations](benchmarks.md#limitations).
 
 See: [`estimate_speakers()`](api.md#diarize.clustering.estimate_speakers)
+and [`cluster_auto()`](api.md#diarize.clustering.cluster_auto)
 
 ## Stage 4: Spectral Clustering
 
@@ -88,9 +95,16 @@ the embedding vectors using cosine similarity as the affinity metric.
 The cosine similarity matrix is rescaled to [0, 1] and passed to
 scikit-learn's `SpectralClustering`.
 
-Adjacent subsegments assigned to the same speaker are merged, and short
-segments that were skipped during embedding extraction are assigned the
-label of the nearest speaker.
+The initial spectral labels are refined with spherical centroid
+reassignment over L2-normalised embeddings. This preserves the selected
+speaker count while reducing unstable one-window label flips.
+
+For long VAD segments, overlapping embedding windows are decoded into
+non-overlapping timeline intervals using window-center midpoints. A
+3-window majority filter smooths local label noise. Adjacent intervals
+assigned to the same speaker are merged, and short segments that were
+skipped during embedding extraction are assigned the label of the nearest
+speaker.
 
 See: [`cluster_spectral()`](api.md#diarize.clustering.cluster_spectral)
 

diff --git a/docs/index.md b/docs/index.md
@@ -29,12 +29,13 @@ for seg in result.segments:
 | GPU required | No | No (7x slower on CPU) | No |
 | HuggingFace account | No | Yes | Yes |
 | Auto speaker count | Yes | Yes | Yes |
-| DER (VoxConverse) | **~10.8%** | ~11.2% | ~8.5% |
+| DER (VoxConverse dev) | **~5.2%** | ~11.2% | ~8.5% |
 | CPU speed (RTF) | **0.12** | 0.86 | --- |
 
 DER and speed numbers for pyannote are from their
 [benchmark page](https://huggingface.co/pyannote/speaker-diarization-3.1).
-Full methodology: [Benchmarks](benchmarks.md).
+The diarize number is from the VoxConverse dev evaluation described in
+[Benchmarks](benchmarks.md).
 
 ## Next Steps