speaker-diarization-benchmarks

CPU diarization benchmark service focused on reproducible, hardware-explicit comparisons.

It runs diarization systems on local datasets, computes standardized metrics, and stores job/report state in FoxNose.

Why this project exists

Diarization results are often hard to compare fairly because teams use different datasets, hardware, threading, and reporting formats. This repository provides one execution protocol and one report format so benchmark numbers are directly comparable and auditable.

Practical value

Runs competing diarization systems under the same CPU conditions
Tracks both quality and speed, not only DER
Captures machine/runtime context (machine_fingerprint) for reproducibility
Produces structured JSON artifacts ready for publishing and regression tracking
Works as an API service, so teams can automate benchmark pipelines

Fairness controls

Fixed thread count per run (cpu_threads)
Optional warmup and multi-run execution (warmup_files, n_runs)
Explicit dataset selection metadata (dataset_manifest)
Standardized metric definitions and report schema across systems

What it measures

DER (collar=0.25, skip_overlap=True)
Speaker count estimation quality (exact, within_1, mae, bucketed stats)
CPU speed (RTF)
Multi-run stability for DER/RTF (mean, median, p95, std)

Key capabilities

API key protected HTTP API
Async jobs with progress and ETA
Multiple systems per job (diarize, pyannote, custom command runners)
Repeat-aware runs (n_runs) with fixed CPU threading (cpu_threads)
Structured reports containing metrics_summary, run_config, dataset_manifest, machine_fingerprint, and per-item rows
Fly.io profile for on-demand dedicated CPU machines

How it works

API and worker: FastAPI
Persistent storage and query model: FoxNose
Runtime flow: create job -> run benchmark worker -> persist report summary + report items
Deployment target: ready-to-run on Fly.io, but the same containerized service can be deployed on other practical targets with long-running containers and persistent storage (for example: VM, Render, Railway, or similar PaaS)

API

All /v1/* routes require header X-API-Key (or custom BENCH_API_KEY_HEADER).

POST /v1/jobs -> create job, returns job_id
GET /v1/jobs/{job_id} -> status/progress/eta
GET /v1/jobs/{job_id}/report -> final report JSON
GET /v1/reports -> paginated report list
GET /v1/datasets -> resolved dataset catalog

Quick start (local Python)

cp .env.example .env
# fill BENCH_API_KEYS and FOXNOSE_* values

mkdir -p data/work data/reports data/venvs data/datasets
export BENCH_DATASETS_FILE=./config/datasets.open_track.json
uvicorn app.main:app --host 0.0.0.0 --port 8080

Quick start (Docker)

cp .env.example .env
# fill BENCH_API_KEYS and FOXNOSE_* values

mkdir -p data/work data/reports data/venvs data/datasets
# docker-compose already points to /app/config/datasets.docker.json (Open Track profile)
docker compose build
docker compose up -d
docker compose logs -f benchmark-api

By default, Docker mounts:

./data/datasets -> /data/datasets (ro)
./data/work -> /app/data/work
./data/reports -> /app/data/reports
./data/venvs -> /app/data/venvs

Example job request

{
  "dataset_id": "voxconverse",
  "n_runs": 3,
  "cpu_threads": 2,
  "warmup_files": 1,
  "systems": [
    {
      "system_id": "diarize",
      "version": "0.1.0",
      "params": {
        "min_speakers": 1,
        "max_speakers": 20
      }
    }
  ]
}

Notes:

n_runs: optional, default 1, max 20
cpu_threads: optional, defaults to BENCH_DEFAULT_CPU_THREADS when set, otherwise all logical CPUs

Reports

Each completed report includes:

metrics_summary, der_summary, rtf_summary
speaker_count_table
run_config (resolved run/thread settings)
dataset_manifest (selection hash + dataset stats)
machine_fingerprint (OS/CPU/RAM/Python/package versions + thread pinning)

Per-item rows (benchmark_report_items) include run_index.

Dataset configuration

Two ways to configure datasets:

via BENCH_DATASETS_FILE (JSON list)
via default layout under BENCH_DATASETS_ROOT

Example file: config/datasets.example.json Open Track profile: config/datasets.open_track.json

Default expected dataset structure:

<root>/<dataset_id>/
  audio/
  rttm/

Open Track dataset IDs configured in this repo:

voxconverse
ami_ihm
ami_sdm
aishell4
alimeeting_ch1
msdwild

Important:

configured means dataset IDs are recognized by the API/catalog.
It does not mean data is already downloaded and ready.

Auto-prepared in this repository (helper scripts/workflow):

voxconverse (workflow mode requires voxconverse_archive_url or pre-populated paths)
aishell4
alimeeting_ch1

Manual preparation required:

ami_ihm
ami_sdm
msdwild

Note:

Service expects normalized local layout (audio + rttm) for each dataset.
If original corpus layout differs, pre-convert or symlink into this structure.

Open Track helper scripts:

# 1) Create expected directory layout under ./data/datasets
./scripts/open_track_init_layout.sh

# 2) Check readiness (audio/rttm/matched files)
./scripts/open_track_check.py --strict

# 3) API smoke run (oracle_reference, limit_files=1)
BENCH_API_KEY="..." ./scripts/open_track_smoke_api.py \
  --base-url "http://127.0.0.1:8080"

Quick local download for publicly available Open Track corpora:

# AISHELL-4 (SLR111 test split)
mkdir -p data/downloads data/downloads/aishell4_test
curl -L --fail -o data/downloads/aishell4_test.tar.gz \
  "https://openslr.trmal.net/resources/111/test.tar.gz"
tar -xzf data/downloads/aishell4_test.tar.gz -C data/downloads/aishell4_test
ln -sfn "$(pwd)/data/downloads/aishell4_test/test/wav" data/datasets/aishell4/audio
ln -sfn "$(pwd)/data/downloads/aishell4_test/test/TextGrid" data/datasets/aishell4/rttm

# AliMeeting (SLR119 eval split, far) + TextGrid -> RTTM conversion
mkdir -p data/downloads data/downloads/alimeeting_eval
curl -L --fail -o data/downloads/alimeeting_eval.tar.gz \
  "https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/AliMeeting/openlr/Eval_Ali.tar.gz"
tar -xzf data/downloads/alimeeting_eval.tar.gz -C data/downloads/alimeeting_eval
python ./scripts/open_track_prepare_alimeeting_eval.py --clean

# Verify
python3 ./scripts/open_track_check.py

Notes:

ami_ihm, ami_sdm, and msdwild are not auto-downloaded in this repo.
You can bind already downloaded corpora into local layout with ./scripts/open_track_bind_sources.py.

If you currently have only VoxConverse:

export BENCH_DATASETS_FILE=./config/datasets.vox_only.json
BENCH_API_KEY="..." ./scripts/open_track_smoke_api.py \
  --base-url "http://127.0.0.1:8080" \
  --dataset-id voxconverse

Fly.io deployment

fly.toml is configured for on-demand CPU benchmarking:

auto-start on request
auto-stop when idle
min_machines_running = 0
dedicated performance CPU
cpus = 2
memory_mb = 4096
BENCH_DEFAULT_CPU_THREADS = 2 to match the Fly VM profile by default
persistent volume mounted at /data for datasets

Typical deploy flow:

flyctl apps create diarization-benchmarks
flyctl volumes create bench_data --region iad --size 120

flyctl secrets set \
  BENCH_API_KEYS="..." \
  FOXNOSE_BASE_URL="https://api.foxnose.net" \
  FOXNOSE_ENV_KEY="..." \
  FOXNOSE_AUTH_MODE="simple" \
  FOXNOSE_PUBLIC_KEY="..." \
  FOXNOSE_SECRET_KEY="..." \
  FOXNOSE_BENCH_JOBS_FOLDER="benchmark_jobs" \
  FOXNOSE_BENCH_REPORTS_FOLDER="benchmark_reports" \
  FOXNOSE_BENCH_REPORT_ITEMS_FOLDER="benchmark_report_items"

flyctl deploy --remote-only --config fly.toml

Prepare datasets on Fly volume via GitHub Actions:

Workflow: .github/workflows/prepare-fly-datasets.yml
Trigger: Actions -> Prepare Fly Datasets -> Run workflow
Selectable datasets in this workflow:
- voxconverse
- aishell4
- alimeeting_ch1
This workflow does not prepare:
- ami_ihm
- ami_sdm
- msdwild
It will:
- start a Fly machine if needed
- download selected datasets to /data/downloads
- prepare normalized layout in /data/datasets
- run open_track_check.py on the remote machine
- optionally stop the machine after completion

Required GitHub secret:

FLY_API_TOKEN

Note:

For voxconverse, you can pass voxconverse_archive_url input (zip/tar/tar.gz with wav+rttm), or pre-populate /data/datasets/voxconverse/{audio,rttm} manually.

Publishing benchmark artifacts

Script: scripts/publish_benchmarks.py

Generates:

benchmarks/runs/YYYY-MM-DD/<job_id>.json
benchmarks/latest.json
benchmarks/latest.md
benchmarks/history.csv

Example:

python scripts/publish_benchmarks.py \
  --base-url "http://127.0.0.1:8080" \
  --api-key "<service_api_key>" \
  --job-id "<job_id>"

By default, published artifacts are sanitized for public use:

removes source.base_url
removes report.artifact_paths.run_dir
removes report.machine_fingerprint.hostname

If you explicitly need internal fields, run with:

python scripts/publish_benchmarks.py --include-internal-fields

Security and public repo checklist

Never commit .env or private keys
Keep only .env.example with placeholders
Store real credentials in CI/Fly secrets
Keep data/datasets and data/downloads outside git history

Pre-publish scan:

./scripts/public_repo_check.sh

License

This repository is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
app		app
config		config
scripts		scripts
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
fly.toml		fly.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

speaker-diarization-benchmarks

Why this project exists

Practical value

Fairness controls

What it measures

Key capabilities

How it works

API

Quick start (local Python)

Quick start (Docker)

Example job request

Reports

Dataset configuration

Fly.io deployment

Publishing benchmark artifacts

Security and public repo checklist

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

speaker-diarization-benchmarks

Why this project exists

Practical value

Fairness controls

What it measures

Key capabilities

How it works

API

Quick start (local Python)

Quick start (Docker)

Example job request

Reports

Dataset configuration

Fly.io deployment

Publishing benchmark artifacts

Security and public repo checklist

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages