Skip to content

voice clone for Dia-1.6B #107

@MuhammadNafishZaldinanda

Description

Currently, TTS.cpp supports inference for the Dia model via CLI, but it does not expose a way to perform voice cloning with an audio reference, as supported by the original Dia implementation in Python.

In the original Dia Python API, we can load an audio reference and transcript to guide the voice characteristics of generated speech:

from dia.model import Dia

model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float16")

clone_from_text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
clone_from_audio = "simple.mp3"

text_to_generate = "[S1] Hello, how are you? [S2] I'm good, thank you."

output = model.generate(
    clone_from_text + text_to_generate,
    audio_prompt=clone_from_audio,
    use_torch_compile=True,
    verbose=True
)

model.save_audio("voice_clone.mp3", output)

Feature request:
Add an inference option in TTS.cpp for the Dia model that allows:

  1. Loading an audio reference file (e.g., .mp3 / .wav).
  2. Providing the transcript of that reference audio.
  3. Generating new audio that mimics the reference voice, directly in the CLI or through a programmatic API.

Proposed interface example (CLI):

./tts --model dia.gguf \
      --text "Hello, how are you?" \
      --clone-audio reference.mp3 \
      --clone-text "[S1] This is the transcript of the reference audio."

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions