Fine tuning doesn't use context?

Since csm is about emotional intelligence, and the underlying llama model generates the 32 audio codebooks based not just on the text to vocalize but also the codebooks of the prior audio (to determine what prosody etc. is relevant in the current context), doesn't it follow that fine tuning must always have at least 1 prior turn as context?

The current lora.py script expects only a single speaker per entry which seems to go against the architecture of csm. By fine tuning it this way aren't you making it less capable of sounding empathetic, the very thing it was built for?

If so, any thoughts on adding turn taking to fine tuning?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fine tuning doesn't use context? #39

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Fine tuning doesn't use context? #39

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions