CPU diarization benchmark service focused on reproducible, hardware-explicit comparisons.
It runs diarization systems on local datasets, computes standardized metrics, and stores job/report state in FoxNose.
Diarization results are often hard to compare fairly because teams use different datasets, hardware, threading, and reporting formats. This repository provides one execution protocol and one report format so benchmark numbers are directly comparable and auditable.
- Runs competing diarization systems under the same CPU conditions
- Tracks both quality and speed, not only DER
- Captures machine/runtime context (
machine_fingerprint) for reproducibility - Produces structured JSON artifacts ready for publishing and regression tracking
- Works as an API service, so teams can automate benchmark pipelines
- Fixed thread count per run (
cpu_threads) - Optional warmup and multi-run execution (
warmup_files,n_runs) - Explicit dataset selection metadata (
dataset_manifest) - Standardized metric definitions and report schema across systems
DER(collar=0.25,skip_overlap=True)- Speaker count estimation quality (
exact,within_1,mae, bucketed stats) - CPU speed (
RTF) - Multi-run stability for DER/RTF (
mean,median,p95,std)
- API key protected HTTP API
- Async jobs with progress and ETA
- Multiple systems per job (
diarize,pyannote, custom command runners) - Repeat-aware runs (
n_runs) with fixed CPU threading (cpu_threads) - Structured reports containing
metrics_summary,run_config,dataset_manifest,machine_fingerprint, and per-item rows - Fly.io profile for on-demand dedicated CPU machines
- API and worker: FastAPI
- Persistent storage and query model: FoxNose
- Runtime flow: create job -> run benchmark worker -> persist report summary + report items
- Deployment target: ready-to-run on Fly.io, but the same containerized service can be deployed on other practical targets with long-running containers and persistent storage (for example: VM, Render, Railway, or similar PaaS)
All /v1/* routes require header X-API-Key (or custom BENCH_API_KEY_HEADER).
POST /v1/jobs-> create job, returnsjob_idGET /v1/jobs/{job_id}-> status/progress/etaGET /v1/jobs/{job_id}/report-> final report JSONGET /v1/reports-> paginated report listGET /v1/datasets-> resolved dataset catalog
cp .env.example .env
# fill BENCH_API_KEYS and FOXNOSE_* values
mkdir -p data/work data/reports data/venvs data/datasets
export BENCH_DATASETS_FILE=./config/datasets.open_track.json
uvicorn app.main:app --host 0.0.0.0 --port 8080cp .env.example .env
# fill BENCH_API_KEYS and FOXNOSE_* values
mkdir -p data/work data/reports data/venvs data/datasets
# docker-compose already points to /app/config/datasets.docker.json (Open Track profile)
docker compose build
docker compose up -d
docker compose logs -f benchmark-apiBy default, Docker mounts:
./data/datasets->/data/datasets(ro)./data/work->/app/data/work./data/reports->/app/data/reports./data/venvs->/app/data/venvs
{
"dataset_id": "voxconverse",
"n_runs": 3,
"cpu_threads": 2,
"warmup_files": 1,
"systems": [
{
"system_id": "diarize",
"version": "0.1.0",
"params": {
"min_speakers": 1,
"max_speakers": 20
}
}
]
}Notes:
n_runs: optional, default1, max20cpu_threads: optional, defaults toBENCH_DEFAULT_CPU_THREADSwhen set, otherwise all logical CPUs
Each completed report includes:
metrics_summary,der_summary,rtf_summaryspeaker_count_tablerun_config(resolved run/thread settings)dataset_manifest(selection hash + dataset stats)machine_fingerprint(OS/CPU/RAM/Python/package versions + thread pinning)
Per-item rows (benchmark_report_items) include run_index.
Two ways to configure datasets:
- via
BENCH_DATASETS_FILE(JSON list) - via default layout under
BENCH_DATASETS_ROOT
Example file: config/datasets.example.json
Open Track profile: config/datasets.open_track.json
Default expected dataset structure:
<root>/<dataset_id>/
audio/
rttm/
Open Track dataset IDs configured in this repo:
voxconverseami_ihmami_sdmaishell4alimeeting_ch1msdwild
Important:
configuredmeans dataset IDs are recognized by the API/catalog.- It does not mean data is already downloaded and ready.
Auto-prepared in this repository (helper scripts/workflow):
voxconverse(workflow mode requiresvoxconverse_archive_urlor pre-populated paths)aishell4alimeeting_ch1
Manual preparation required:
ami_ihmami_sdmmsdwild
Note:
- Service expects normalized local layout (
audio+rttm) for each dataset. - If original corpus layout differs, pre-convert or symlink into this structure.
Open Track helper scripts:
# 1) Create expected directory layout under ./data/datasets
./scripts/open_track_init_layout.sh
# 2) Check readiness (audio/rttm/matched files)
./scripts/open_track_check.py --strict
# 3) API smoke run (oracle_reference, limit_files=1)
BENCH_API_KEY="..." ./scripts/open_track_smoke_api.py \
--base-url "http://127.0.0.1:8080"Quick local download for publicly available Open Track corpora:
# AISHELL-4 (SLR111 test split)
mkdir -p data/downloads data/downloads/aishell4_test
curl -L --fail -o data/downloads/aishell4_test.tar.gz \
"https://openslr.trmal.net/resources/111/test.tar.gz"
tar -xzf data/downloads/aishell4_test.tar.gz -C data/downloads/aishell4_test
ln -sfn "$(pwd)/data/downloads/aishell4_test/test/wav" data/datasets/aishell4/audio
ln -sfn "$(pwd)/data/downloads/aishell4_test/test/TextGrid" data/datasets/aishell4/rttm
# AliMeeting (SLR119 eval split, far) + TextGrid -> RTTM conversion
mkdir -p data/downloads data/downloads/alimeeting_eval
curl -L --fail -o data/downloads/alimeeting_eval.tar.gz \
"https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/AliMeeting/openlr/Eval_Ali.tar.gz"
tar -xzf data/downloads/alimeeting_eval.tar.gz -C data/downloads/alimeeting_eval
python ./scripts/open_track_prepare_alimeeting_eval.py --clean
# Verify
python3 ./scripts/open_track_check.pyNotes:
ami_ihm,ami_sdm, andmsdwildare not auto-downloaded in this repo.- You can bind already downloaded corpora into local layout with
./scripts/open_track_bind_sources.py.
If you currently have only VoxConverse:
export BENCH_DATASETS_FILE=./config/datasets.vox_only.json
BENCH_API_KEY="..." ./scripts/open_track_smoke_api.py \
--base-url "http://127.0.0.1:8080" \
--dataset-id voxconversefly.toml is configured for on-demand CPU benchmarking:
- auto-start on request
- auto-stop when idle
min_machines_running = 0- dedicated performance CPU
cpus = 2memory_mb = 4096BENCH_DEFAULT_CPU_THREADS = 2to match the Fly VM profile by default- persistent volume mounted at
/datafor datasets
Typical deploy flow:
flyctl apps create diarization-benchmarks
flyctl volumes create bench_data --region iad --size 120
flyctl secrets set \
BENCH_API_KEYS="..." \
FOXNOSE_BASE_URL="https://api.foxnose.net" \
FOXNOSE_ENV_KEY="..." \
FOXNOSE_AUTH_MODE="simple" \
FOXNOSE_PUBLIC_KEY="..." \
FOXNOSE_SECRET_KEY="..." \
FOXNOSE_BENCH_JOBS_FOLDER="benchmark_jobs" \
FOXNOSE_BENCH_REPORTS_FOLDER="benchmark_reports" \
FOXNOSE_BENCH_REPORT_ITEMS_FOLDER="benchmark_report_items"
flyctl deploy --remote-only --config fly.tomlPrepare datasets on Fly volume via GitHub Actions:
- Workflow:
.github/workflows/prepare-fly-datasets.yml - Trigger:
Actions -> Prepare Fly Datasets -> Run workflow - Selectable datasets in this workflow:
voxconverseaishell4alimeeting_ch1
- This workflow does not prepare:
ami_ihmami_sdmmsdwild
- It will:
- start a Fly machine if needed
- download selected datasets to
/data/downloads - prepare normalized layout in
/data/datasets - run
open_track_check.pyon the remote machine - optionally stop the machine after completion
Required GitHub secret:
FLY_API_TOKEN
Note:
- For
voxconverse, you can passvoxconverse_archive_urlinput (zip/tar/tar.gz with wav+rttm), or pre-populate/data/datasets/voxconverse/{audio,rttm}manually.
Script: scripts/publish_benchmarks.py
Generates:
benchmarks/runs/YYYY-MM-DD/<job_id>.jsonbenchmarks/latest.jsonbenchmarks/latest.mdbenchmarks/history.csv
Example:
python scripts/publish_benchmarks.py \
--base-url "http://127.0.0.1:8080" \
--api-key "<service_api_key>" \
--job-id "<job_id>"By default, published artifacts are sanitized for public use:
- removes
source.base_url - removes
report.artifact_paths.run_dir - removes
report.machine_fingerprint.hostname
If you explicitly need internal fields, run with:
python scripts/publish_benchmarks.py --include-internal-fields- Never commit
.envor private keys - Keep only
.env.examplewith placeholders - Store real credentials in CI/Fly secrets
- Keep
data/datasetsanddata/downloadsoutside git history
Pre-publish scan:
./scripts/public_repo_check.shThis repository is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).