Speaker-diarized transcription CLI powered by Cohere Transcribe and pyannote
uv syncBoth the Cohere Transcribe and pyannote speaker-diarization models are gated. You must:
- Accept the terms on each model page linked above.
- Log in with the Hugging Face CLI:
uvx hf auth loginThis will prompt you for a User Access Token with read access.
uv run python main.py --audio recording.wavuv run python main.py --youtube "https://www.youtube.com/watch?v=VIDEO_ID"| Flag | Default | Description |
|---|---|---|
--audio |
Path to a local audio file | |
--youtube |
YouTube URL to download and transcribe | |
--language |
en |
Language code (e.g. en, fr, de, es, ja, zh) |
--num-speakers |
auto | Fixed number of speakers (auto-detected if omitted) |
--backend |
auto |
ASR backend: mlx, cuda, cpu, or auto |
--output |
transcription.txt |
Output file path |
--merge-gap |
0.35 |
Max gap (seconds) to merge same-speaker segments |
--min-island |
0.20 |
Min duration (seconds) for isolated speaker segments |
--left-pad |
0.35 |
Padding before each segment (seconds) |
--right-pad |
0.05 |
Padding after each segment (seconds) |
By default (--backend auto), the CLI picks the best available backend:
- MLX on Apple silicon
- CUDA on systems with an NVIDIA GPU
- CPU otherwise
You can override with --backend mlx, --backend cuda, or --backend cpu. Passing --backend cpu also forces diarization to run on CPU.
uv run python main.py --audio meeting.wav --num-speakers 3 --output meeting.txtOutput (meeting.txt):
SPEAKER_00 [00:00.00 - 00:12.34]:
Welcome everyone to the meeting. Let's start with the first agenda item.
SPEAKER_01 [00:12.34 - 00:25.67]:
Thanks. I wanted to discuss the timeline for the next release.