Diarization Simulation

Overview
How It Works
Statistical Model
Installation
Usage
Synthetic Ground Truth Generation
Typical Workflow
Python API
Citation

A Python package for simulating speaker diarization with LENA and VTC from ground truth vocalization data.

Diarization algorithms segment and classify speech into predefined speaker categories (including child (CHI) other child (OCH), female adult (FEM), male adult (MAL)). In Child Development and Language Acquisition research, these segments are aggregated into vocalization counts (see below) measuring children's speech output and their speech input in naturalistic daylong recordings.

However, algorithms make errors (e.g. by mistaking speakers for one another) which propagate into the measured vocalization counts, introducing biases in downstream analyses. Simulating diarization algorithms can help assess the sensitivity of a statistical analysis to classification errors. For instance, simulations can help determine whether correlations between speakers' speech quantities are entirely consistent with spurious correlations due to classification errors.

Overview

Diarization Simulation is a tool designed to simulate the distortion of vocalization counts from different speakers by diarization algorithms. It takes synthetic ground truth data as its input (the true speaker vocalization counts) and simulates measured vocalization counts based on the detection and confusion rates of LENA and VTC. The confusion rates of these algorithms were measured on calibration data consisting of 30 hours of manual annotations.

How It Works

The simulation works by:

Loading synthetic ground truth data (the "true" vocalization counts per speaker and per observation/recording)
Loading pre-computed hyperparameters characterizing the behavior of the chosen algorithm (VTC or LENA)
For each sample and observation, generating "measured" vocalization counts using a statistical model representing the algorithm's behavior.

Statistical Model

The simulation uses a hierarchical model where:

Detection/confusion rates $\lambda_{ij}$ follow: $\lambda_{ij} \sim \mathrm{Gamma}(\alpha_{ij}, \mu_{ij}/\alpha_{ij})$
Detected vocalizations are generated using one of two distribution options:

The Poisson distribution:

$$\mathrm{Detected}_{ij} \sim \mathrm{Poisson}(\lambda_{ij} \cdot \mathrm{true}_{i})$$

$$\mathrm{Detected}_{j} = \sum_i \mathrm{Detected}_{ij}$$

The Gamma distribution:

$$\mathrm{Detected}_{ij} \sim \lfloor\mathrm{Gamma}(\alpha, \beta)+0.5\rfloor$$

$$\mathrm{Detected}_{j} = \sum_i \mathrm{Detected}_{ij}$$

With $\alpha$ and $\beta$ being fixed such that:

$$\mathbb{E}[\mathrm{Detected}_{ij}] = \lambda_{ij} \cdot \mathrm{true}_{i} \text{ and } \sigma[\mathrm{Detected}_{ij}] = \sqrt{\frac{\lambda_{ij} \cdot \mathrm{true}_{i}}{\tau}}$$

Parameters

Parameter	Description
$\lambda_{ij}$	Detection rate from speaker $i$ to detected speaker $j$
$\mathrm{true}_{i}$	True vocalization count for speaker $i$
$\tau$	Underdispersion parameter
$\alpha_{ij}, \mu_{ij}$	Shape and scale parameters for the detection rate prior
$\alpha, \beta$	Shape and rate parameters for the gamma detection model

Model Characteristics

The original model assumed a Generalized Poisson Distribution, given that the vocalization counts are underdispersed wrt the Poisson distribution. However, sampling from this distribution is a bit harder, and the simulation proposes two approximation schemes instead:

Poisson scheme: neglects the underdispersion of the count data
Gamma scheme: better captures the true variance but only approximate for small count data

Installation

# Clone the repository
git clone https://github.com/LAAC-LSCP/diarization-simulation.git
cd diarization-simulation

# Install the package
pip install -e .

Requirements

You will need Python 3.8+ to run this package. Key dependencies include:

pandas
numpy
scipy
numba
tqdm

For the generation of synthetic ground-truth data, you will also need the following packages:

cmdstanpy (see installation instructions here)
ChildProject

Usage

The package can be used both programmatically (in Python scripts, notebooks, etc.) and via command-line interface.

Command-line Interface

The main command line interfaced can be accessed through diarization-simulate:

diarization-simulate --truth path/to/truth.csv \
                    --output path/to/output.csv \
                    --algo vtc \
                    --samples 1000 \
                    --distribution poisson

Command-line Arguments

Argument	Description	Default
`--truth`	Path to the synthetic truth dataset (in csv format)	Required
`--output`	Location of the output file	Required
`--output-format`	Output file format (`csv`, `parquet`, or `npz`)	`csv`
`--algo`	Algorithm to simulate (`vtc` or `lena`)	Required
`--samples`	Number of samples per observation	1000
`--average-hyperpriors`	Use the mean value of the hyperpriors (mu and alpha)	False
`--unique-hyperpriors`	Use fixed hyperpriors (mu and alpha) throughout all samples	False
`--distribution`	Distribution for vocalization counts (`poisson` or `gamma`)	`poisson`
`--seed`	Random seed for reproducibility	None

Input Format

The input CSV must contain the following columns:

observation: Unique identifier for each recording/observation
CHI: Child vocalization count
OCH: Other child vocalization count
FEM: Female adult vocalization count
MAL: Male adult vocalization count

Example:

observation,CHI,OCH,FEM,MAL
1,120,30,200,50
2,90,15,180,70
3,150,25,220,45

Output Format

The output will contain the following columns:

sample: Sample number (0 to n_samples-1)
observation: Original observation identifier
CHI: Simulated child vocalization detection
OCH: Simulated other child vocalization detection
FEM: Simulated female adult vocalization detection
MAL: Simulated male adult vocalization detection

Example output:

sample,observation,CHI,OCH,FEM,MAL
0,1,118,28,195,52
0,2,87,16,175,73
0,3,145,23,215,48
1,1,122,31,198,49
1,2,92,14,182,68
...

Synthetic Ground Truth Generation

The package includes a tool for generating synthetic ground truth data for a given corpus, using manual annotations to infer a realistic speech distribution. The target corpus must be compatible with the ChildProject python package (which should also be installed).

Command-line Interface for Ground Truth Generation

truth-simulate --corpus path/to/corpus \
               --annotator annotation_set_name \
               --output path/to/ground_truth.csv \
               --samples 1000

Command-line Arguments for `truth-simulate`

Argument	Description	Default
`--corpus`	Path to the input ChildProject corpus	Required
`--annotator`	Annotation set containing the manual annotations	Required
`--output`	Location of the output file	Required
`--recordings`	Path to a CSV dataframes containing the list of recordings	None
`--samples`	Number of samples to generate	1000
`--mode`	Sample from the mode of the posterior distribution of hyperparameters	False
`--show-distribution`	Show the marginal distribution of speech for each speaker according to the manual annotations	False

How Ground Truth Generation Works

The truth-simulate tool uses a Bayesian hierarchical model to infer vocalization rate distributions from sparse manual annotations and then generates complete ground truth datasets. The process works as follows:

Load corpus data: Reads a ChildProject corpus containing recordings and manual annotations
Extract annotation statistics: Counts vocalizations per speaker type (CHI, OCH, FEM, MAL) in manually annotated segments
Fit hierarchical model: Uses Stan to fit a Gamma-Poisson model that estimates vocalization rates per speaker across the corpus
Generate samples: Produces synthetic ground truth vocalization counts for all recordings in the corpus

Ground Truth Output Format

The output CSV contains synthetic ground truth data with the following columns:

recording_filename: Original recording filename
observation: Unique identifier combining recording filename and sample number (e.g., "recording_001.wav,0")
CHI: Simulated child vocalization count
OCH: Simulated other child vocalization count
FEM: Simulated female adult vocalization count
MAL: Simulated male adult vocalization count

Example output:

recording_filename,observation,CHI,OCH,FEM,MAL
recording_001.wav,"recording_001.wav,0",145,23,198,67
recording_002.wav,"recording_002.wav,0",112,18,176,45
recording_001.wav,"recording_001.wav,1",138,25,203,72
recording_002.wav,"recording_002.wav,1",119,16,181,49
...

The output contains KxN rows where K is the number of recordings and N the number of samples requested.

Typical Workflow

A complete simulation workflow typically involves two steps:

Generate ground truth from your corpus annotations:

truth-simulate --corpus /path/to/corpus \
               --annotator human_annotations \
               --output ground_truth.csv \
               --samples 100

Simulate diarization on the generated ground truth:

diarization-simulate --truth ground_truth.csv \
                    --output simulated_detections.csv \
                    --algo vtc \
                    --samples 100

The output simulated_detections.csv will contain 100x100xK rows, where K is the number of recordings in the dataset.

Python API

Quick Start

import pandas as pd
from diarization_simulation import simulate_diarization

# Create or load your truth data
truth_df = pd.DataFrame(
    {
        'observation': [1, 2, 3],
        'CHI': [120, 90, 150],
        'OCH': [30, 15, 25],
        'FEM': [200, 180, 220],
        'MAL': [50, 70, 45]
    }
)

# Simulate detections
results = simulate_diarization(
    truth_data=truth_df,
    algorithm="vtc",
    distribution="poisson",
    n_samples=1000,
    random_seed=42
)

print(f"Generated {len(results)} detection samples")
print(results.head())

Working with DataFrames

# Load your data
truth_data = pd.read_csv("my_truth_data.csv")

# Quick simulation for analysis
results = simulate_diarization(
    truth_data=truth_data,
    algorithm="vtc",
    n_samples=100,
    hyperprior_mode="unique",  # Same hyperpriors for all samples
    verbose=False  # Disable progress bar
)

# Analyze results
mean_detections = results.groupby('observation')[['CHI', 'OCH', 'FEM', 'MAL']].mean()
print("Mean detections per observation:")
print(mean_detections)

Python API

simulate_diarization() function parameters:

Parameter	Type	Description	Default
`truth_data`	str or DataFrame	Path to CSV file or pandas DataFrame with truth data	Required
`algorithm`	str	Algorithm to simulate (`"vtc"` or `"lena"`)	`"vtc"`
`distribution`	str	Distribution type (`"poisson"` or `"gamma"`)	`"poisson"`
`n_samples`	int	Number of samples to generate per observation	1000
`hyperprior_mode`	str	Hyperprior handling (`"sample"`, `"average"`, `"unique"`)	`"sample"`
`random_seed`	int or None	Random seed for reproducibility	None
`verbose`	bool	Show progress bar	True

Hyperprior modes:

"sample": Each sample gets its own hyperpriors (captures algorithm uncertainty)
"average": Use mean hyperprior values (reduced variance)
"unique": Same hyperpriors for all samples (minimal variance)

Example workflow:

import pandas as pd
from diarization_simulation import simulate_diarization

# Load your ground truth data
truth_data = pd.read_csv("ground_truth.csv")

# Run simulations with different parameters
algorithms = ["vtc", "lena"]
distributions = ["poisson", "gamma"]
results = {}

for algo in algorithms:
    for dist in distributions:
        key = f"{algo}_{dist}"
        results[key] = simulate_diarization(
            truth_data=truth_data,
            algorithm=algo,
            distribution=dist,
            n_samples=1000,
            random_seed=42  # For reproducibility
        )

# Compare results
for key, result in results.items():
    correlation = result[['CHI', 'FEM']].corr().iloc[0, 1]
    print(f"{key}: CHI-FEM correlation = {correlation:.3f}")

Citation

If you use this package, please mention both of the following references:

@online{diarization-simulation,
author={Lucas Gautheron},
year=2025,
title={Diarization Simulation: A Python package for simulating speaker diarization with {LENA and VTC} from ground truth vocalization data},
url={https://github.com/LAAC-LSCP/diarization-simulation}
}

@misc{Gautheron2025,
  title = {Classification errors distort findings in automated speech processing: examples and solutions from child-development research},
  url = {http://dx.doi.org/10.31234/osf.io/u925y_v1},
  author = {Gautheron,  Lucas and Kidd,  Evan and Malko,  Anton and Lavechin,  Marvin and Cristia,  Alejandrina},
  year = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
diarization_simulation		diarization_simulation
docs		docs
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diarization Simulation

Overview

How It Works

Statistical Model

Parameters

Model Characteristics

Installation

Requirements

Usage

Command-line Interface

Command-line Arguments

Input Format

Output Format

Synthetic Ground Truth Generation

Command-line Interface for Ground Truth Generation

Command-line Arguments for `truth-simulate`

How Ground Truth Generation Works

Ground Truth Output Format

Typical Workflow

Python API

Quick Start

Working with DataFrames

Python API

Citation

About

Uh oh!

Releases

Packages

Languages

LAAC-LSCP/diarization-simulation

Folders and files

Latest commit

History

Repository files navigation

Diarization Simulation

Overview

How It Works

Statistical Model

Parameters

Model Characteristics

Installation

Requirements

Usage

Command-line Interface

Command-line Arguments

Input Format

Output Format

Synthetic Ground Truth Generation

Command-line Interface for Ground Truth Generation

Command-line Arguments for truth-simulate

How Ground Truth Generation Works

Ground Truth Output Format

Typical Workflow

Python API

Quick Start

Working with DataFrames

Python API

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Command-line Arguments for `truth-simulate`

Packages