Potential model training in any language using voice conversion #266

MzHub · 2025-05-24T06:25:28Z

MzHub
May 24, 2025

Disclaimer: I am not a machine learning expert nor a data scientist in any way.

The short version is this idea I had in my mind:

Pick any TTS, even a single-speaker one, in any language.
Generate 100,000 samples of your wake word.
Pick a voice conversion model.
Acquire 100 samples around 1-5 minutes each of different native speakers speaking anything in your chosen language.
Augment the TTS samples using voice conversion with the speech samples as the target voice.
Train a wake word model on the results.

I tried it and the short version of the results is that it works. I got a good model in non-English language. How good? I do not have numbers as I have no idea what I'm doing. But it feels great and has not failed to detect any speaker yet.

I wanted to share this for a couple of reasons. First if there are others as desperate for a non-English model as I am, you could try to replicate it. And secondly if someone who knows what they're doing gets interested, you could try to formalize this a bit.

On to the long version of what I ended up doing:

TTS

I was initially going to pick the first local TTS that I'd get working.
While I was working on that, OpenAI released gpt-4o-mini-tts in their API and a playground at openai.fm.
It supports prompts, so I tried prompting it with "[Nationality] person speaking [native language]" (replaced with actual nationality and language) and it worked surprisingly well.
I didn't want to commit to spending too much so I revised the plan to instead generate 1000 TTS samples and use each voice with each TTS sample (100 × 1000).
Ended up throwing $5 at OpenAI and spent about half of that on generating ~1000 positive samples and ~1000 negative samples, following their API documentation.
For positive samples I first tried prompts that would vary the way the wake word is spoken, like "happy", "sleepy" etc. with a dozen variations. This did generate very varied samples but I ended up deciding against it and generated the final samples with only three prompts: "slow", "fast" and no variation.
For negative samples, I manually wrote down phrases that could get mixed with the wake word and passed those to the TTS.
This isn't really single-speaker anymore, since gpt-4o-mini-tts supports multiple speakers, and I used them all, but I am unsure how much that affected the end result.
The gpt-4o-mini-tts has a few common failure modes though, or at least had at the time I used it.
It can generate long silence in the audio file, which is easy to trim with ffmpeg.
Another one is that it can start adding gibberish or repeating the wake word. These are also fairly easy to detect in that they are longer than the valid samples.
I sorted by audio length and manually reviewed the longer clips and removed them from the dataset.
And third, the samples can get clipped at the start. This may not be an issue for each use case but it was for me.
As another side quest I ended up running all the samples through Whisper with a prompt "does this say [non-clipped wake word] or [clipped wake word]" and its accuracy was higher than when I tried to manually filter them, so I used it to remove the clipped ones.
I lost about 15% of samples to the different failure modes.

Voice conversion

I acquired the speech samples by recording desktop audio of publicly available videos/audio of interviews, readings, any clips where a single speaker speaks for more than a minute.
I picked Coqui(?) TTS(?) for the voice conversion models.
The situation around Coqui is very confusing.
Apparently Coqui has shut down? But there's a fork of the tools? I have little idea what I should call these so I'll just use coqui and coqui tools, but keep in mind that those are not the correct names.
I kept getting confused whether I'm using the deprecated original coqui or the fork, and which one's documentation I was reading.
I wish I could explain better what I ended up doing here but I am still confused.
I think I ended up using the fork and its own documentation, as it had fewer bugs and more models.
The forked coqui tools(?) comes with support for four voice conversion models: freevc24, knnvc, openvoice_v1, openvoice_v2.
I tried running some of the TTS samples through all of them to see how they compare.
Fairly quickly dropped knnvc.
The others were good for some speaker + sample combinations and worse for some.
In the end I decided to keep all three (freevc24, openvoice_v1, openvoice_v2) and generate the final samples with the three of them.
Then I ran voice conversion with the TTS samples as source_wav and the speaker samples as target_wav.
I used UUID4 as a prefix for all file names so that they naturally sorted randomly.
Ended up with ~320k positive samples and ~320k negative samples.

Training

In short I skipped the piper sample generation step and ran the rest of the openWakeWord steps as-is.
To make the rest of the steps work, I split 300k positive samples into positive_train and 20k into positive_test.
Then I did the same for negative_train and negative_test.
Trained the model on CPU because GPU ran out of memory.

Questions I'm left with:

Would using a single-speaker local TTS work as well? I currently don't have the ability to compare models.
Would the varied TTS samples have had better performance, similar or worse?
Using gpt-4o-mini-tts with multiple speakers is the voice conversion even necessary? Although generating 100k samples will get more costly and the error rate is an kssue.

dscripka · 2025-05-26T20:39:07Z

dscripka
May 26, 2025
Maintainer

This is really fantastic work, thank you for both trying this experiment and explaining it so thoroughly! I know support for non-english languages has been a request for quite some time, but the lack of high-quality multi-speaker TTS models for many different languages was also a huge challenge. My assumption was that approaches like voice-conversion or cross-lingual generation would lose to much detail/accuracy to be useful as training data, but I'm glad to see that this was not the case!

I would be interested in trying to adapt your approach and see how effective it can be with relatively small amounts of data, including both the number of speakers in the source language and the number of cloning target voice references. If these can be brought down enough to keep the cost and computation reasonably low, I think this would be another great example training notebook to include along with the english-focused training notebook (if you are interested in sharing the details of your code and approach, of course).

4 replies

MzHub May 29, 2025
Author

Thanks for taking a look! I appreciate it, I thought you might've vanished.

One note to add is that I also used a Gemini 2.5 Pro written patch for this bug #202 not sure how much it affects the result.

Code-wise my work may not be useful or reproducible. I haven't had time to clean anything up as it was trial and error. I did everything locally in virtualenvs. I kept running into version conflicts (a lock file or pip freeze generated file would help a lot) and bugs and in general just installing and running anything in the python ML ecosystem is tough. I also have the openWakeWord virtualenv and coqui virtualenv in the same project directory and the .py snippets are all a mess. I kept switching between Windows and Linux.

I'll try to collect here all the code used:

Gemini's fix for the concurrency bug. My contribution to this is zero.
Gemini's explanation for the bug
Model YAML I used.
I downloaded the background noise etc. augmentation data using the notebook provided in this discussion Updated automatic model training notebook for "high performance" models #215
The varied voices OpenAI API generation
Less varied voices OpenAI API generation
Negative samples OpenAI API generation
Incomplete Piper TTS -> voice conversion from when I was working with local TTS.
Incomplete Coqui + fairseq TTS -> voice conversion from when I was working with local TTS.
gpt-4o-mini-tts -> voice conversion that I ended up using.
Recorder I used for speech samples

I ran all the notebook steps manually as I needed to be able to debug any problems, which there were endless amount of.

Compute-wise I tried to get GPU working and sometimes did but more often I could not or it was equally fast to use CPU. For most of the computation I used a fairly beefy but nowhere near ML optimized Linux laptop. Some of it was on desktop PC older than the laptop but I never got the GPU to work on that. Longest time I had to run anything was overnight so less than 8 hours and some of that was dumb stuff like noticing all 640k .wav samples were 24k instead of 16k and I had to run ffmpeg on them overnight. Actually that was probably the longest "computation" (probably mostly overhead) others were a couple hours at most.

By the way, after using the resulting model for a while, I've noticed these issues compared to "ok nabu" that I used before:

While I haven't found a person that can't trigger the wakeword for some it depends on how exactly they say it. It doesn't seem quite as reliable as "ok nabu".
I get slightly more false positives than with "ok nabu". I'd get maybe one or two every two days with "ok nabu" and about one or two every day for my model. Could also be due to "ok nabu" being non-native and less likely to be mixed with regular speech.
The wakeword is "hey [word]" in non-English. The model triggers for just "[word]" very easily. Although I haven't had it trigger a false positive yet if "[word]" is used in a sentence.

EDIT: Before I started working on this experiment, I experimented with trying to get an English TTS to speak the non-English wakeword by iterating on the exact characters in the TTS prompt. The result wasn't great and there's no way a native speaker could trigger the wake word unless you told them how to pronounce it. So compared to that earlier experiment this voice conversion one was a great success.

MzHub Jun 2, 2025
Author

Oh, hey, I found some numbers. I do not know what they mean though or if they are any good:

Final Model Accuracy: 0.7056000232696533
Final Model Recall: 0.4120500087738037
Final Model False Positives per Hour: 0.0

The false positive rate seems especially suspicious. Perhaps it is because I used my own adversarial data (before augmentation, I did use the original augmentation), and it wasn't as good of a choice as the automatically generated is..?

dscripka Jun 2, 2025
Maintainer

I haven't vanished yet, but I certainly haven't been active for quite a while. Hopefully that will change soon.

I looked through some of the files you shared, thanks for posting them. This is already very helpful, and I understand that making all of this reproducible is a difficult task. I think I can take the foundational concepts here and combine them with some other approaches to make a streamlined version in the future. The important fact is just knowing that the process can work and does produce a model that is usable in practice.

Tanner85 Jul 26, 2025

I haven't vanished yet, but I certainly haven't been active for quite a while. Hopefully that will change soon.

I looked through some of the files you shared, thanks for posting them. This is already very helpful, and I understand that making all of this reproducible is a difficult task. I think I can take the foundational concepts here and combine them with some other approaches to make a streamlined version in the future. The important fact is just knowing that the process can work and does produce a model that is usable in practice.

Don't vanish. I posted yesterday a Thank You post on this discussion board. I'm doing a project and I found out OpenWakeWord is one of the very few opensource wake word projects existing. Don't abandon it, develop it. Ask for money, for crowdfunding, or something like that. This project deserves a lot!

FranGalan · 2025-05-27T01:39:03Z

FranGalan
May 27, 2025

Hey @MzHub, thanks for your contribution! I'm currently building a personal assistant that is triggered by the word "Wil". It has similar phonetics for both Spanish and English (not exactly the same though), so it is good to know that people are looking into multi-language capabilities.

We’re facing the challenge of embedding a custom wake word model into an ESP32 ("Wil"). I came across the microWakeWord project, which looks promising, but it may fall short in terms of robustness for our use case.

We're looking for someone who can help us train a reliable custom model and embed it efficiently into our microcontroller. If you or @dscripka are open to working together on this as freelancers, I’d love to connect and discuss further.

Thanks again — looking forward to hearing from you!

4 replies

MzHub May 29, 2025
Author

Sorry to say but I lack the skill set and time to help out. If I had both, I'd definitely work on verifying and formalizing the voice conversion experiment.

dscripka Jun 2, 2025
Maintainer

For running a model entirely on an ESP32, microWakeWord will likely offer the best possible performance for such limited hardware. It is a really good library that has been built specifically for that scenario.

From the investigations I've done, I don't believe it is realistically possible to reduce the size of the openWakeWord models to something that can run on an ESP32.

FranGalan Jun 2, 2025

I see that now. We're currently migrating to the Apollo 3 Blue Plus, from Ambiq, which supports TensorFLow Lite and already provides VAD and wake-word solutions natively, mostly driven from the hardware.

emilianoarmango Dec 15, 2025

Hi! Any update about Spanish training? It is very interesting for a personal project. I wonder if 100,000 examples for the selected word to wake are required, or if it could be done with fewer audios.

Potential model training in any language using voice conversion #266

Uh oh!

Uh oh!

MzHub May 24, 2025

TTS

Voice conversion

Training

Replies: 2 comments · 8 replies

Uh oh!

Uh oh!

dscripka May 26, 2025 Maintainer

Uh oh!

Uh oh!

MzHub May 29, 2025 Author

Uh oh!

Uh oh!

MzHub Jun 2, 2025 Author

Uh oh!

dscripka Jun 2, 2025 Maintainer

Uh oh!

Tanner85 Jul 26, 2025

Uh oh!

FranGalan May 27, 2025

Uh oh!

MzHub May 29, 2025 Author

Uh oh!

dscripka Jun 2, 2025 Maintainer

Uh oh!

FranGalan Jun 2, 2025

Uh oh!

emilianoarmango Dec 15, 2025

MzHub
May 24, 2025

Replies: 2 comments 8 replies

dscripka
May 26, 2025
Maintainer

MzHub May 29, 2025
Author

MzHub Jun 2, 2025
Author

dscripka Jun 2, 2025
Maintainer

FranGalan
May 27, 2025

MzHub May 29, 2025
Author

dscripka Jun 2, 2025
Maintainer