Skip to content

Releases: FoxNoseTech/diarize

v0.1.1

06 Mar 17:20

Choose a tag to compare

This patch release fixes dependency compatibility for audio loading.

Fixed

  • Pinned torch and torchaudio to a compatible range:
    • torch>=1.13,<2.9
    • torchaudio>=0.13,<2.9
  • Prevents failures where newer torchaudio requires torchcodec.

Docs

  • Clarified that diarize now installs a compatible torch/torchaudio range automatically.

No API changes.

v0.1.0 — Initial Release

01 Mar 11:30

Choose a tag to compare

diarize v0.1.0

Speaker diarization for Python — answers "who spoke when?" in any audio file. CPU-only, no GPU, no API keys, no account signup.

Highlights

  • ~10.8% DER on VoxConverse dev set — lower than pyannote's free models (community-1 and 3.1 legacy, both ~11.2%)
  • ~8x faster than real-time on CPU (RTF 0.12 vs pyannote community-1's 0.86)
  • Automatic speaker count detection via GMM BIC with silhouette refinement (1–7 speakers)
  • Zero setup frictionpip install diarize and you're done, no HuggingFace token or account needed

Pipeline

Silero VAD → WeSpeaker ResNet34-LM (ONNX) → GMM BIC → Spectral Clustering

All four stages run on CPU. All components are open-source with permissive licenses.

Usage

from diarize import diarize

result = diarize("meeting.wav")
for seg in result.segments:
    print(f"  [{seg.start:.1f}s - {seg.end:.1f}s] {seg.speaker}")

Known Limitations

  • Benchmarked on a single dataset (VoxConverse). Cross-dataset validation is planned.
  • Speaker count estimation degrades for 8+ speakers — pass num_speakers explicitly when known.
  • Overlapping speech is not modeled — each segment is assigned to one speaker.