feat: expose speaker embeddings and subsegments in DiarizeResult by smm-h · Pull Request #4 · FoxNoseTech/diarize

smm-h · 2026-03-25T09:40:10Z

Summary

Expose speaker embeddings and subsegments from the diarization pipeline via an opt-in return_artifacts parameter, stored in a separate serializable DiarizeArtifacts model.

Changes

utils.py: Added DiarizeArtifacts model with embeddings: list[list[float]] and subsegments: list[SubSegment] — plain Python types, fully serializable, no numpy on the public API. Added artifacts: DiarizeArtifacts | None = None field on DiarizeResult.
__init__.py: Added return_artifacts: bool = False keyword argument to diarize(). When True, converts embeddings to nested lists via .tolist() and populates result.artifacts. When False (default), artifacts is None.

Motivation

Use case: cross-recording speaker clustering and identification. When processing multiple audio files, having access to the raw speaker embeddings allows users to cluster or match speakers across recordings — something that is not possible with just the segment labels.

Design decisions (addressing review feedback)

Opt-in: artifacts are only computed/stored when return_artifacts=True — zero cost at default.
Separate model: DiarizeArtifacts keeps internal pipeline data off the main DiarizeResult surface.
Serializable: embeddings are list[list[float]] (not numpy arrays), so model_dump(), JSON serialization, and equality all work cleanly. No arbitrary_types_allowed needed.
Fully backward-compatible: default behavior is identical to upstream main.

loookashow · 2026-05-04T09:01:56Z

Hey, @smm-h

The feature itself is useful, especially for advanced/debugging workflows. I’m supportive of exposing subsegments / embeddings, but not as raw ndarray fields on the default public result object. That couples internal pipeline artifacts to the main API and introduces serialization/equality regressions.

I’d suggest making this opt-in (for example return_artifacts=True) and/or exposing it through a separate serializable artifacts object or accessor, rather than storing raw numpy arrays directly on DiarizeResult.

Thank you.

smm-h · 2026-05-04T09:16:41Z

Thanks for the feedback @loookashow — good points on all counts.

I've revised the PR to address your concerns:

Opt-in: added a return_artifacts=True keyword argument to diarize(). When False (the default), nothing changes — artifacts is None.
Separate serializable model: introduced DiarizeArtifacts with embeddings: list[list[float]] and subsegments: list[SubSegment]. All plain Python types — no numpy arrays on the public API, no arbitrary_types_allowed, and model_dump() / JSON serialization work cleanly.
No performance impact: .tolist() conversion only runs when opted in.

Let me know if this is more in line with what you had in mind, or if you'd prefer a different API shape.

loookashow · 2026-05-04T11:52:01Z

Thanks @smm-h, this API shape is much better.

I checked the updated branch locally:

model_dump_json() works with populated artifacts
equality works
tests pass when run against the PR source

A few small things before merge:

ruff currently fails on src/diarize/utils.py because the import block is not formatted.
Please add tests for return_artifacts=True, default artifacts is None, and JSON serialization of populated artifacts.
Please update docs/API reference for return_artifacts, DiarizeArtifacts, and DiarizeResult.artifacts.
Minor wording: “When False, nothing changes” is not strictly true for serialized output, since model_dump() now includes "artifacts": None.

After those are addressed, this looks good to me.

loookashow self-requested a review May 4, 2026 09:02

feat: expose speaker embeddings and subsegments via opt-in artifacts

6cd7bfc

smm-h force-pushed the main branch from 5ea110b to 6cd7bfc Compare May 4, 2026 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expose speaker embeddings and subsegments in DiarizeResult#4

feat: expose speaker embeddings and subsegments in DiarizeResult#4
smm-h wants to merge 1 commit intoFoxNoseTech:mainfrom
smm-h:main

smm-h commented Mar 25, 2026 •

edited

Loading

Uh oh!

loookashow commented May 4, 2026

Uh oh!

smm-h commented May 4, 2026

Uh oh!

loookashow commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

smm-h commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Motivation

Design decisions (addressing review feedback)

Uh oh!

loookashow commented May 4, 2026

Uh oh!

smm-h commented May 4, 2026

Uh oh!

loookashow commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

smm-h commented Mar 25, 2026 •

edited

Loading