Skip to content

feat: expose speaker embeddings and subsegments in DiarizeResult#4

Open
smm-h wants to merge 1 commit intoFoxNoseTech:mainfrom
smm-h:main
Open

feat: expose speaker embeddings and subsegments in DiarizeResult#4
smm-h wants to merge 1 commit intoFoxNoseTech:mainfrom
smm-h:main

Conversation

@smm-h
Copy link
Copy Markdown

@smm-h smm-h commented Mar 25, 2026

Summary

Expose speaker embeddings and subsegments from the diarization pipeline via an opt-in return_artifacts parameter, stored in a separate serializable DiarizeArtifacts model.

Changes

  • utils.py: Added DiarizeArtifacts model with embeddings: list[list[float]] and subsegments: list[SubSegment] — plain Python types, fully serializable, no numpy on the public API. Added artifacts: DiarizeArtifacts | None = None field on DiarizeResult.
  • __init__.py: Added return_artifacts: bool = False keyword argument to diarize(). When True, converts embeddings to nested lists via .tolist() and populates result.artifacts. When False (default), artifacts is None.

Motivation

Use case: cross-recording speaker clustering and identification. When processing multiple audio files, having access to the raw speaker embeddings allows users to cluster or match speakers across recordings — something that is not possible with just the segment labels.

Design decisions (addressing review feedback)

  • Opt-in: artifacts are only computed/stored when return_artifacts=True — zero cost at default.
  • Separate model: DiarizeArtifacts keeps internal pipeline data off the main DiarizeResult surface.
  • Serializable: embeddings are list[list[float]] (not numpy arrays), so model_dump(), JSON serialization, and equality all work cleanly. No arbitrary_types_allowed needed.
  • Fully backward-compatible: default behavior is identical to upstream main.

@loookashow
Copy link
Copy Markdown
Contributor

Hey, @smm-h

The feature itself is useful, especially for advanced/debugging workflows. I’m supportive of exposing subsegments / embeddings, but not as raw ndarray fields on the default public result object. That couples internal pipeline artifacts to the main API and introduces serialization/equality regressions.

I’d suggest making this opt-in (for example return_artifacts=True) and/or exposing it through a separate serializable artifacts object or accessor, rather than storing raw numpy arrays directly on DiarizeResult.

Thank you.

@loookashow loookashow self-requested a review May 4, 2026 09:02
@smm-h
Copy link
Copy Markdown
Author

smm-h commented May 4, 2026

Thanks for the feedback @loookashow — good points on all counts.

I've revised the PR to address your concerns:

  • Opt-in: added a return_artifacts=True keyword argument to diarize(). When False (the default), nothing changes — artifacts is None.
  • Separate serializable model: introduced DiarizeArtifacts with embeddings: list[list[float]] and subsegments: list[SubSegment]. All plain Python types — no numpy arrays on the public API, no arbitrary_types_allowed, and model_dump() / JSON serialization work cleanly.
  • No performance impact: .tolist() conversion only runs when opted in.

Let me know if this is more in line with what you had in mind, or if you'd prefer a different API shape.

@loookashow
Copy link
Copy Markdown
Contributor

Thanks @smm-h, this API shape is much better.

I checked the updated branch locally:

  • model_dump_json() works with populated artifacts
  • equality works
  • tests pass when run against the PR source

A few small things before merge:

  • ruff currently fails on src/diarize/utils.py because the import block is not formatted.
  • Please add tests for return_artifacts=True, default artifacts is None, and JSON serialization of populated artifacts.
  • Please update docs/API reference for return_artifacts, DiarizeArtifacts, and DiarizeResult.artifacts.
  • Minor wording: “When False, nothing changes” is not strictly true for serialized output, since model_dump() now includes "artifacts": None.

After those are addressed, this looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants