- Overview
- How It Works
- Statistical Model
- Installation
- Usage
- Synthetic Ground Truth Generation
- Typical Workflow
- Python API
- Citation
A Python package for simulating speaker diarization with LENA and VTC from ground truth vocalization data.
Diarization algorithms segment and classify speech into predefined speaker categories (including child (CHI) other child (OCH), female adult (FEM), male adult (MAL)). In Child Development and Language Acquisition research, these segments are aggregated into vocalization counts (see below) measuring children's speech output and their speech input in naturalistic daylong recordings.
However, algorithms make errors (e.g. by mistaking speakers for one another) which propagate into the measured vocalization counts, introducing biases in downstream analyses. Simulating diarization algorithms can help assess the sensitivity of a statistical analysis to classification errors. For instance, simulations can help determine whether correlations between speakers' speech quantities are entirely consistent with spurious correlations due to classification errors.
Diarization Simulation is a tool designed to simulate the distortion of vocalization counts from different speakers by diarization algorithms. It takes synthetic ground truth data as its input (the true speaker vocalization counts) and simulates measured vocalization counts based on the detection and confusion rates of LENA and VTC. The confusion rates of these algorithms were measured on calibration data consisting of 30 hours of manual annotations.
The simulation works by:
- Loading synthetic ground truth data (the "true" vocalization counts per speaker and per observation/recording)
- Loading pre-computed hyperparameters characterizing the behavior of the chosen algorithm (VTC or LENA)
- For each sample and observation, generating "measured" vocalization counts using a statistical model representing the algorithm's behavior.
The simulation uses a hierarchical model where:
- Detection/confusion rates
$\lambda_{ij}$ follow:$\lambda_{ij} \sim \mathrm{Gamma}(\alpha_{ij}, \mu_{ij}/\alpha_{ij})$ - Detected vocalizations are generated using one of two distribution options:
- The Poisson distribution:
- The Gamma distribution:
With
| Parameter | Description |
|---|---|
| Detection rate from speaker |
|
| True vocalization count for speaker |
|
| Underdispersion parameter | |
| Shape and scale parameters for the detection rate prior | |
| Shape and rate parameters for the gamma detection model |
The original model assumed a Generalized Poisson Distribution, given that the vocalization counts are underdispersed wrt the Poisson distribution. However, sampling from this distribution is a bit harder, and the simulation proposes two approximation schemes instead:
- Poisson scheme: neglects the underdispersion of the count data
- Gamma scheme: better captures the true variance but only approximate for small count data
# Clone the repository
git clone https://github.com/LAAC-LSCP/diarization-simulation.git
cd diarization-simulation
# Install the package
pip install -e .You will need Python 3.8+ to run this package. Key dependencies include:
- pandas
- numpy
- scipy
- numba
- tqdm
For the generation of synthetic ground-truth data, you will also need the following packages:
- cmdstanpy (see installation instructions here)
- ChildProject
The package can be used both programmatically (in Python scripts, notebooks, etc.) and via command-line interface.
The main command line interfaced can be accessed through diarization-simulate:
diarization-simulate --truth path/to/truth.csv \
--output path/to/output.csv \
--algo vtc \
--samples 1000 \
--distribution poisson| Argument | Description | Default |
|---|---|---|
--truth |
Path to the synthetic truth dataset (in csv format) | Required |
--output |
Location of the output file | Required |
--output-format |
Output file format (csv, parquet, or npz) |
csv |
--algo |
Algorithm to simulate (vtc or lena) |
Required |
--samples |
Number of samples per observation | 1000 |
--average-hyperpriors |
Use the mean value of the hyperpriors (mu and alpha) | False |
--unique-hyperpriors |
Use fixed hyperpriors (mu and alpha) throughout all samples | False |
--distribution |
Distribution for vocalization counts (poisson or gamma) |
poisson |
--seed |
Random seed for reproducibility | None |
The input CSV must contain the following columns:
observation: Unique identifier for each recording/observationCHI: Child vocalization countOCH: Other child vocalization countFEM: Female adult vocalization countMAL: Male adult vocalization count
Example:
observation,CHI,OCH,FEM,MAL
1,120,30,200,50
2,90,15,180,70
3,150,25,220,45The output will contain the following columns:
sample: Sample number (0 ton_samples-1)observation: Original observation identifierCHI: Simulated child vocalization detectionOCH: Simulated other child vocalization detectionFEM: Simulated female adult vocalization detectionMAL: Simulated male adult vocalization detection
Example output:
sample,observation,CHI,OCH,FEM,MAL
0,1,118,28,195,52
0,2,87,16,175,73
0,3,145,23,215,48
1,1,122,31,198,49
1,2,92,14,182,68
...The package includes a tool for generating synthetic ground truth data for a given corpus, using manual annotations to infer a realistic speech distribution. The target corpus must be compatible with the ChildProject python package (which should also be installed).
truth-simulate --corpus path/to/corpus \
--annotator annotation_set_name \
--output path/to/ground_truth.csv \
--samples 1000| Argument | Description | Default |
|---|---|---|
--corpus |
Path to the input ChildProject corpus | Required |
--annotator |
Annotation set containing the manual annotations | Required |
--output |
Location of the output file | Required |
--recordings |
Path to a CSV dataframes containing the list of recordings | None |
--samples |
Number of samples to generate | 1000 |
--mode |
Sample from the mode of the posterior distribution of hyperparameters | False |
--show-distribution |
Show the marginal distribution of speech for each speaker according to the manual annotations | False |
The truth-simulate tool uses a Bayesian hierarchical model to infer vocalization rate distributions from sparse manual
annotations and then generates complete ground truth datasets. The process works as follows:
- Load corpus data: Reads a ChildProject corpus containing recordings and manual annotations
- Extract annotation statistics: Counts vocalizations per speaker type (CHI, OCH, FEM, MAL) in manually annotated segments
- Fit hierarchical model: Uses Stan to fit a Gamma-Poisson model that estimates vocalization rates per speaker across the corpus
- Generate samples: Produces synthetic ground truth vocalization counts for all recordings in the corpus
The output CSV contains synthetic ground truth data with the following columns:
recording_filename: Original recording filenameobservation: Unique identifier combining recording filename and sample number (e.g., "recording_001.wav,0")CHI: Simulated child vocalization countOCH: Simulated other child vocalization countFEM: Simulated female adult vocalization countMAL: Simulated male adult vocalization count
Example output:
recording_filename,observation,CHI,OCH,FEM,MAL
recording_001.wav,"recording_001.wav,0",145,23,198,67
recording_002.wav,"recording_002.wav,0",112,18,176,45
recording_001.wav,"recording_001.wav,1",138,25,203,72
recording_002.wav,"recording_002.wav,1",119,16,181,49
...The output contains KxN rows where K is the number of recordings and N the number of samples requested.
A complete simulation workflow typically involves two steps:
- Generate ground truth from your corpus annotations:
truth-simulate --corpus /path/to/corpus \
--annotator human_annotations \
--output ground_truth.csv \
--samples 100- Simulate diarization on the generated ground truth:
diarization-simulate --truth ground_truth.csv \
--output simulated_detections.csv \
--algo vtc \
--samples 100The output simulated_detections.csv will contain 100x100xK rows, where K is the number of recordings in the dataset.
import pandas as pd
from diarization_simulation import simulate_diarization
# Create or load your truth data
truth_df = pd.DataFrame(
{
'observation': [1, 2, 3],
'CHI': [120, 90, 150],
'OCH': [30, 15, 25],
'FEM': [200, 180, 220],
'MAL': [50, 70, 45]
}
)
# Simulate detections
results = simulate_diarization(
truth_data=truth_df,
algorithm="vtc",
distribution="poisson",
n_samples=1000,
random_seed=42
)
print(f"Generated {len(results)} detection samples")
print(results.head())# Load your data
truth_data = pd.read_csv("my_truth_data.csv")
# Quick simulation for analysis
results = simulate_diarization(
truth_data=truth_data,
algorithm="vtc",
n_samples=100,
hyperprior_mode="unique", # Same hyperpriors for all samples
verbose=False # Disable progress bar
)
# Analyze results
mean_detections = results.groupby('observation')[['CHI', 'OCH', 'FEM', 'MAL']].mean()
print("Mean detections per observation:")
print(mean_detections)simulate_diarization() function parameters:
| Parameter | Type | Description | Default |
|---|---|---|---|
truth_data |
str or DataFrame | Path to CSV file or pandas DataFrame with truth data | Required |
algorithm |
str | Algorithm to simulate ("vtc" or "lena") |
"vtc" |
distribution |
str | Distribution type ("poisson" or "gamma") |
"poisson" |
n_samples |
int | Number of samples to generate per observation | 1000 |
hyperprior_mode |
str | Hyperprior handling ("sample", "average", "unique") |
"sample" |
random_seed |
int or None | Random seed for reproducibility | None |
verbose |
bool | Show progress bar | True |
Hyperprior modes:
"sample": Each sample gets its own hyperpriors (captures algorithm uncertainty)"average": Use mean hyperprior values (reduced variance)"unique": Same hyperpriors for all samples (minimal variance)
Example workflow:
import pandas as pd
from diarization_simulation import simulate_diarization
# Load your ground truth data
truth_data = pd.read_csv("ground_truth.csv")
# Run simulations with different parameters
algorithms = ["vtc", "lena"]
distributions = ["poisson", "gamma"]
results = {}
for algo in algorithms:
for dist in distributions:
key = f"{algo}_{dist}"
results[key] = simulate_diarization(
truth_data=truth_data,
algorithm=algo,
distribution=dist,
n_samples=1000,
random_seed=42 # For reproducibility
)
# Compare results
for key, result in results.items():
correlation = result[['CHI', 'FEM']].corr().iloc[0, 1]
print(f"{key}: CHI-FEM correlation = {correlation:.3f}")If you use this package, please mention both of the following references:
@online{diarization-simulation,
author={Lucas Gautheron},
year=2025,
title={Diarization Simulation: A Python package for simulating speaker diarization with {LENA and VTC} from ground truth vocalization data},
url={https://github.com/LAAC-LSCP/diarization-simulation}
}
@misc{Gautheron2025,
title = {Classification errors distort findings in automated speech processing: examples and solutions from child-development research},
url = {http://dx.doi.org/10.31234/osf.io/u925y_v1},
author = {Gautheron, Lucas and Kidd, Evan and Malko, Anton and Lavechin, Marvin and Cristia, Alejandrina},
year = {2025},
}