voice clone for Dia-1.6B

Currently, TTS.cpp supports inference for the Dia model via CLI, but it does not expose a way to perform voice cloning with an audio reference, as supported by the original Dia implementation in Python.

In the original Dia Python API, we can load an audio reference and transcript to guide the voice characteristics of generated speech:
```
from dia.model import Dia

model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float16")

clone_from_text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
clone_from_audio = "simple.mp3"

text_to_generate = "[S1] Hello, how are you? [S2] I'm good, thank you."

output = model.generate(
    clone_from_text + text_to_generate,
    audio_prompt=clone_from_audio,
    use_torch_compile=True,
    verbose=True
)

model.save_audio("voice_clone.mp3", output)
```

Feature request:
Add an inference option in TTS.cpp for the Dia model that allows:

1. Loading an audio reference file (e.g., .mp3 / .wav).
2. Providing the transcript of that reference audio.
3. Generating new audio that mimics the reference voice, directly in the CLI or through a programmatic API.

Proposed interface example (CLI):
```
./tts --model dia.gguf \
      --text "Hello, how are you?" \
      --clone-audio reference.mp3 \
      --clone-text "[S1] This is the transcript of the reference audio."
```





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

voice clone for Dia-1.6B #107

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

voice clone for Dia-1.6B #107

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions