Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ In this endeavor, MacOS and metal support will be treated as the primary platfor
| [Parler TTS Large](https://huggingface.co/parler-tts/parler-tts-large-v1)|✓|✓|✓|[here](https://huggingface.co/mmwillet2/Parler_TTS_GGUF)|
| [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) |✓|✗|✓|[here](https://huggingface.co/mmwillet2/Kokoro_GGUF) |
| [Dia](https://github.com/nari-labs/dia) |✓|✓|✓|[here](https://huggingface.co/mmwillet2/Dia_GGUF) |
| [Orpheus](https://github.com/canopyai/Orpheus-TTS) |✓|✗|✗|[here](https://huggingface.co/mmwillet2/Orpheus_GGUF) |

Additional Model support will initially be added based on open source model performance in both the [old TTS model arena](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) and [new TTS model arena](https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2) as well as the availability of said models' architectures and checkpoints.

Expand Down
33 changes: 26 additions & 7 deletions examples/cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ This simple example cli tool can be used to generate speach from a text prompt a

In order to get a detailed breakdown the functionality currently available you can call the cli with the `--help` parameter. This will return a breakdown of all parameters:
```bash
./cli --help
./tts-cli --help

--temperature (-t):
The temperature to use when generating outputs. Defaults to 1.0.
Expand Down Expand Up @@ -52,25 +52,44 @@ In order to get a detailed breakdown the functionality currently available you c
General usage should follow from these possible parameters. E.G. The following command will save generated speech to the `/tmp/test.wav` file.

```bash
./cli --model-path /model/path/to/gguf_file.gguf --prompt "I am saying some words" --save-path /tmp/test.wav
./tts-cli --model-path /model/path/to/gguf_file.gguf --prompt "I am saying some words" --save-path /tmp/test.wav
```

#### Dia Generation Arguments
#### Dia and Orpheus Generation Arguments

Currently the default cli arguments are not aligned with Dia's default sampling settings. Specifically the temperature and topk settings should be changed to `1.3` and `35` respectively when generating with Dia like so:
Currently the default cli arguments are not aligned with Dia's or Orpheus' default sampling settings. Specifically the temperature and topk settings should be changed to `1.3` and `35` respectively when generating with Dia like so:

```base
./cli --model-path /model/path/to/Dia.gguf --prompt "[S1] Hi, I am Dia, this is how I talk." --save-path /tmp/test.wav --topk 35 --temperature 1.3
```bash
./tts-cli --model-path /model/path/to/Dia.gguf --prompt "[S1] Hi, I am Dia, this is how I talk." --save-path /tmp/test.wav --topk 35 --temperature 1.3
```

and the voice, temperature, and repetition penalty setting should be changed to a valid voice (e.g. `leah`), `0.7`, and `1.1` respectively when generating with Orpheus like so:

```bash
./tts-cli --model-path /model/path/to/Orpheus.gguf --prompt "Hi, I am Orpheus, this is how I talk." --save-path /tmp/test.wav --voice leah --temperature 0.7 --repetition-penalty 1.1
```


#### Conditional Generation

Conditional generation is a Parler TTS specific behavior.

By default the Parler TTS model is saved to the GGUF format with a pre-encoded conditional prompt (i.e. a prompt used to determine how to generate speech), but if the text encoder model, the T5-Encoder model, is avaiable in gguf format (see the [python convertion scripts](../../py-gguf/README.md) for more information on how to prepare the T5-Encoder model) then a new conditional prompt can be used for generation like so:

```bash
./cli --model-path /model/path/to/gguf_file.gguf --prompt "I am saying some words" --save-path /tmp/test.wav --text-encoder-path /model/path/to/t5_encoder_file.gguf --consditional-prompt "deep voice"
./tts-cli --model-path /model/path/to/gguf_file.gguf --prompt "I am saying some words" --save-path /tmp/test.wav --text-encoder-path /model/path/to/t5_encoder_file.gguf --consditional-prompt "deep voice"
```

#### Distinct Voice Support

Kokoro and Orpheus both support voices which can be set via the `--voice` (`-v`) argument. Orpheus supports the following voices:

```
"zoe", "zac","jess", "leo", "mia", "julia", "leah"
```

and Kokoro supports the voices listedin the section below.

#### MultiLanguage Configuration

Kokoro supports multiple langauges with distinct voices, and, by default, the standard voices are encoded in the Kokoro gguf file. Below is a list of the available voices:
Expand Down
2 changes: 1 addition & 1 deletion ggml
2 changes: 2 additions & 0 deletions include/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,14 @@ enum tts_arch {
PARLER_TTS_ARCH = 0,
KOKORO_ARCH = 1,
DIA_ARCH = 2,
ORPHEUS_ARCH = 3,
};

const std::map<std::string, tts_arch> SUPPORTED_ARCHITECTURES = {
{ "parler-tts", PARLER_TTS_ARCH },
{ "kokoro", KOKORO_ARCH },
{ "dia", DIA_ARCH },
{ "orpheus", ORPHEUS_ARCH }
};

struct generation_configuration {
Expand Down
2 changes: 2 additions & 0 deletions include/tts.h
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,15 @@
#include "parler_model.h"
#include "kokoro_model.h"
#include "dia_model.h"
#include "orpheus_model.h"
#include <thread>
#include <fstream>
#include <array>

struct tts_runner * parler_tts_from_file(gguf_context * meta_ctx, ggml_context * weight_ctx, int n_threads, generation_configuration * config, tts_arch arch, bool cpu_only);
struct tts_runner * kokoro_from_file(gguf_context * meta_ctx, ggml_context * weight_ctx, int n_threads, generation_configuration * config, tts_arch arch, bool cpu_only);
struct tts_runner * dia_from_file(gguf_context * meta_ctx, ggml_context * weight_ctx, int n_threads, generation_configuration * config, tts_arch arch, bool cpu_only);
struct tts_runner * orpheus_from_file(gguf_context * meta_ctx, ggml_context * weight_ctx, int n_threads, generation_configuration * config, tts_arch arch, bool cpu_only);
struct tts_runner * runner_from_file(const std::string & fname, int n_threads, generation_configuration * config, bool cpu_only = true);
int generate(tts_runner * runner, std::string sentence, struct tts_response * response, generation_configuration * config);
void update_conditional_prompt(tts_runner * runner, const std::string file_path, const std::string prompt, bool cpu_only = true);
Expand Down
21 changes: 21 additions & 0 deletions py-gguf/convert_orpheus_to_gguf
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/usr/bin/env python3

import argparse
from tts_encoders.orpheus_gguf_encoder import OrpheusEncoder, DEFAULT_ORPHEUS_REPO_ID, DEFAULT_SNAC_REPO_ID
from os.path import isdir, dirname


def parse_arguments():
parser = argparse.ArgumentParser()
parser.add_argument("--save-path", type=str, required=True, help="the path to save the converted gguf tts model too.")
parser.add_argument("--repo-id", type=str, required=False, default=DEFAULT_ORPHEUS_REPO_ID, help="The Huggingface repository to pull the model from.")
parser.add_argument("--snac-repo-id", type=str, required=False, default=DEFAULT_SNAC_REPO_ID, help="The Huggingface repository to pull the snac audio decoder model from.")
parser.add_argument("--never-make-dirs", default=False, action="store_true", help="When set the script will never add new directories.")
return parser.parse_known_args()


if __name__ == '__main__':
args, _ = parse_arguments()
if not isdir(dirname(args.save_path)) and args.never_make_dirs:
raise ValueError(f"model path, {args.save_path} is not a valid path.")
OrpheusEncoder(args.save_path, repo_id=args.repo_id).write()
8 changes: 5 additions & 3 deletions py-gguf/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ gguf==0.10.0
spacy==3.8.5
kokoro==0.9.4
huggingface-hub>=0.26.5
transformers>=4.43.3
parler_tts @ git+https://github.com/huggingface/parler-tts.git@8e465f1b5fcd223478e07175cb40494d19ffbe17
transformers>=4.46.0
parler_tts @ git+https://github.com/huggingface/parler-tts.git@d108732cd57788ec86bc857d99a6cabd66663d68
gguf==0.10.0
safetensors==0.5.3
groovy==0.1.2
Expand All @@ -14,5 +14,7 @@ gradio-client==1.10.0
llvmlite==0.44.0
numba==0.61.2
scipy>=1.15.2
snac==1.2.1
soundfile>=0.13.1
nari-tts @ git+https://github.com/nari-labs/dia.git@7cf50c889c6013f74326cbdcb7696a985a4cf9c1
nari-tts @ git+https://github.com/nari-labs/dia.git@2811af1c5f476b1f49f4744fabf56cf352be21e5
torchvision==0.21.0
1 change: 1 addition & 0 deletions py-gguf/tts_encoders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@
from .kokoro_gguf_encoder import *
from .dia_gguf_encoder import *
from .dac_gguf_encoder import *
from .orpheus_gguf_encoder import *
2 changes: 1 addition & 1 deletion py-gguf/tts_encoders/dia_gguf_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ def prepare_decoder_tensors(self):
elif parts[0] == "norm":
self.set_tensor(f"{base}.norm", param)
elif parts[0] == "logits_dense":
heads = param.shape[1];
heads = param.shape[1]
for i in range(heads):
head = param.data[:, i]
self.set_tensor(f"{base}.heads.{i}", head.transpose(0,1))
Expand Down
2 changes: 1 addition & 1 deletion py-gguf/tts_encoders/kokoro_gguf_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ class KokoroEncoder(TTSEncoder):
gguf_encoder.write()
```
"""
def __init__(self, model_path: Path | str = "./kokoro.gguf", repo_id: Path | str =DEFAULT_KOKORO_REPO,
def __init__(self, model_path: Path | str = "./kokoro.gguf", repo_id: Path | str = DEFAULT_KOKORO_REPO,
voices: Optional[List[str]] = None, use_espeak: bool = False,
phonemizer_repo: Path | str = DEFAULT_TTS_PHONEMIZER_REPO):
"""
Expand Down
Loading