Since csm is about emotional intelligence, and the underlying llama model generates the 32 audio codebooks based not just on the text to vocalize but also the codebooks of the prior audio (to determine what prosody etc. is relevant in the current context), doesn't it follow that fine tuning must always have at least 1 prior turn as context?
The current lora.py script expects only a single speaker per entry which seems to go against the architecture of csm. By fine tuning it this way aren't you making it less capable of sounding empathetic, the very thing it was built for?
If so, any thoughts on adding turn taking to fine tuning?
Since csm is about emotional intelligence, and the underlying llama model generates the 32 audio codebooks based not just on the text to vocalize but also the codebooks of the prior audio (to determine what prosody etc. is relevant in the current context), doesn't it follow that fine tuning must always have at least 1 prior turn as context?
The current lora.py script expects only a single speaker per entry which seems to go against the architecture of csm. By fine tuning it this way aren't you making it less capable of sounding empathetic, the very thing it was built for?
If so, any thoughts on adding turn taking to fine tuning?