Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion. First, we modify the speech content by translating the phonetic-content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is vastly superior to current approaches and even beats text-based systems in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples are available under the following link: https://speechbot.github.io/emotion.
翻译:感官情绪转换是改变感知到的言语表达情绪的任务,同时保留词汇内容和发言者身份。在这项研究中,我们把情绪转换问题作为口头语言翻译任务。我们使用将语音信号分解成分解的分解形式,由语音-调听器、分解特征、语音和情感组成。首先,我们通过将语音-调听器转换成目标情感来修改语音内容,然后根据这些单位预测发音特征。最后,通过将预测的表达方式输入神经电动器,产生语音波变形。这种范式使我们能够超越信号的光谱和参数变化,以及模范非语言发音化,例如笑声插入、扬声器删除等。我们客观和主观地表明,拟议的方法非常优于当前的方法,甚至从感知的情感和音质上击败基于文本的系统。我们严格评价了这种复杂系统的所有组成部分,并以广泛的模型分析和断裂式研究来结束,以更好地强调建筑选择、强弱。根据提议的方法,可选取:MARBPOBE/Rests。