The model is sensitive to the wider situation surrounding each utterance—it assesses whether something makes sense by how it ties to preceding and succeeding text. This zoomed-out perspective allows it to intonate longer fragments properly by overlaying a particular train of thought that spans multiple sentences with a unifying emotional pattern.
There are a couple of tips for producing emotions:
Context is vital for generating specific emotions. Thus, one might get a happy output if one inputs laughing/funny text. Similarly, setting the context is critical for anger, sadness, and other emotions.
Punctuation and voice settings lead to how the output is delivered.
Please add emphasis by putting the relevant words/phrases in quotation marks.
For speech generated using a cloned voice, the speaking style contained in the samples you upload for cloning is replicated in the output. So, if the uploaded sample’s speech is monotone, the model will struggle to produce expressive output.
These are the best tips for producing emotions, but they do not guarantee the result. We will introduce features that will allow for the control of emotions within the text.