Replies: 2 comments 8 replies
-
|
This is really fantastic work, thank you for both trying this experiment and explaining it so thoroughly! I know support for non-english languages has been a request for quite some time, but the lack of high-quality multi-speaker TTS models for many different languages was also a huge challenge. My assumption was that approaches like voice-conversion or cross-lingual generation would lose to much detail/accuracy to be useful as training data, but I'm glad to see that this was not the case! I would be interested in trying to adapt your approach and see how effective it can be with relatively small amounts of data, including both the number of speakers in the source language and the number of cloning target voice references. If these can be brought down enough to keep the cost and computation reasonably low, I think this would be another great example training notebook to include along with the english-focused training notebook (if you are interested in sharing the details of your code and approach, of course). |
Beta Was this translation helpful? Give feedback.
-
|
Hey @MzHub, thanks for your contribution! I'm currently building a personal assistant that is triggered by the word "Wil". It has similar phonetics for both Spanish and English (not exactly the same though), so it is good to know that people are looking into multi-language capabilities. We’re facing the challenge of embedding a custom wake word model into an ESP32 ("Wil"). I came across the microWakeWord project, which looks promising, but it may fall short in terms of robustness for our use case. We're looking for someone who can help us train a reliable custom model and embed it efficiently into our microcontroller. If you or @dscripka are open to working together on this as freelancers, I’d love to connect and discuss further. Thanks again — looking forward to hearing from you! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Disclaimer: I am not a machine learning expert nor a data scientist in any way.
The short version is this idea I had in my mind:
I tried it and the short version of the results is that it works. I got a good model in non-English language. How good? I do not have numbers as I have no idea what I'm doing. But it feels great and has not failed to detect any speaker yet.
I wanted to share this for a couple of reasons. First if there are others as desperate for a non-English model as I am, you could try to replicate it. And secondly if someone who knows what they're doing gets interested, you could try to formalize this a bit.
On to the long version of what I ended up doing:
TTS
gpt-4o-mini-ttsin their API and a playground at openai.fm.gpt-4o-mini-ttssupports multiple speakers, and I used them all, but I am unsure how much that affected the end result.gpt-4o-mini-ttshas a few common failure modes though, or at least had at the time I used it.Voice conversion
freevc24,knnvc,openvoice_v1,openvoice_v2.knnvc.freevc24,openvoice_v1,openvoice_v2) and generate the final samples with the three of them.source_wavand the speaker samples astarget_wav.Training
positive_trainand 20k intopositive_test.negative_trainandnegative_test.Questions I'm left with:
gpt-4o-mini-ttswith multiple speakers is the voice conversion even necessary? Although generating 100k samples will get more costly and the error rate is an kssue.Beta Was this translation helpful? Give feedback.
All reactions