Replies: 1 comment
-
|
Appreciate the time you put into UltraSinger. I’ve been auditing my scripts for a potential PR and realized I built them too much for my own 'lab' and not enough for the public. I’m going through the code again to strip out the personal paths and generalize the logic. Coding is an art, and the 'polishing' phase always takes longer than people think, but I'm sticking with it until it's ready. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
UltraSinger is a very promising project, but I have as yet failed to produce a usable output file. :( I do not have too much time to spend on this project, but I intend to allocate at least a bit of my spare time to it.
Here follows a list of what I would like to do is, in the this order I plan to try to implement it. If anyone have opinions on this, or (even better!) would like to help me work on this, please let me know.
I aim to do this first since it will help future development, and it feels like a fairly trivial thing to do, just making sure all intermediate data is actually saved in the cache, and to condition parts of the code with CLI arguments.
Add an (optional) volume normalization step, once again as suggested by @RobUmf's findings. I'm harboring some healthy skepticism on how much this will actually help since rescaling the volume will in theory at best keep quality intact and at worst lower quality. But then again, the AI models might perform best when the volume is in a certain range, and correcting for this might compensate for any loss in quality. So it's definitely worth a try, and should not be very hard to do.
Try to integrate crepe_notes, or a new algorithm inspired by crepe_notes, to improve pitching.
I think this step is most likely to help raising the pitching quality. I'm convinced that crepe_notes approach is sound. Their idea is to look for peaks in the derivative of pitch to detect note changes (and assume a constant note is played between pitch changes), coupled with the inverse of the confidence score (basically, when crepe is not confident that a note is played, then this is most likely to be a note change). I have verified on stand-alone runs that crepe_notes does a much better job at generating a midi file with notes that correspond to the vocals than the current UltraSinger model. I would not say it's perfect, but it's definitely as good as you could ever expect output from an automated tool. (Some manual adjustments are always likely to be needed.)
However, this is also the hardest part me. I don't really understand the current UltraSinger model, nor how I could plugin crepe_notes into it. My understanding is that US currently looks at the output from the speech-to-text model to determine when words start and stop, and only looks at pitches in these intervals, presumably assuming that a single note is present in the entire word, unless proven otherwise..?
At any rate, the words in the text and the midi information from pitching needs to be aligned. I guess it is likely that the output from crepe_notes will be typically shorter than what the s2t model outputs, since the note is probably not sharp enough at the start and end of a word. Something like this, I guess:
In this case, we should assume that the entire word is sung in C, and that the note should start and end with the word.
It might also be the case that the note starts or ends outside the word, then either the pitching or the text-to-speech is incorrect -- most likely the latter, otherwise why would the pitcher pick up a note?
The offshoot seems to be that we should align notes and words, and try to use the maximum extent of both.
Then we have the issue of words where different syllables is sung at different notes. E.g.
In this case I think we should assume that the note switch is immediate, and happens exactly in the middle of the space where the first note ends and the second note begins. But I guess we need to determine that this situation actually is at hands, so the presence of an additional note in the frame of the word is just an artifact or bad timing. That is, the part of the note that is overlapping with the word must be long enough to reasonably count as a separate note.
I have frankly been unable to understand how this splitting is done now. Is the speech-to-text model providing clues where syllable boundaries are? Or is the word hyphenated according to some lexical rules?
Beta Was this translation helpful? Give feedback.
All reactions