The way that humans encode their emotion into speech signals is complex. For instance, an angry man may increase his pitch and speaking rate, and use impolite words. In this paper, we present a preliminary study on various emotional factors and investigate how each of them impacts modern emotion recognition systems. The key tool of our study is the SpeechFlow model presented recently, by which we are able to decompose speech signals into separate information factors (content, pitch, rhythm). Based on this decomposition, we carefully studied the performance of each information component and their combinations. We conducted the study on three different speech emotion corpora and chose an attention-based convolutional RNN as the emotion classifier. Our results show that rhythm is the most important component for emotional expression. Moreover, the cross-corpus results are very bad (even worse than guess), demonstrating that the present speech emotion recognition model is rather weak. Interestingly, by removing one or several unimportant components, the cross-corpus results can be improved. This demonstrates the potential of the decomposition approach towards a generalizable emotion recognition.
翻译:人类将情感融入言语信号的方式是复杂的。 例如,愤怒的人可能会增加他的声调和说话速度,并使用不礼貌的言词。 在本文中,我们提出一份关于各种情感因素的初步研究报告,并调查其中每一种因素如何影响现代情感识别系统。我们研究的关键工具是最近推出的SpeeFlow模型,通过该模型,我们可以将言语信号分解成不同的信息因素(调、调、节奏) 。基于这一分解,我们仔细研究了每个信息组成部分的性能及其组合。我们研究了三种不同的言语情感,并选择了一个关注的共振RNN作为情感分类师。我们的结果显示,节奏是情感表达的最重要组成部分。此外,跨体的结果非常糟糕(甚至比猜测更糟 ), 表明目前的言语识别模型相当弱。 有趣的是, 通过删除一个或几个无关重要的组成部分,跨体的结果可以改进。 这显示了解调方法对于普遍情感识别的潜力。