Currently, the performance of Speech Emotion Recognition (SER) systems is mainly constrained by the absence of large-scale labelled corpora. Data augmentation is regarded as a promising approach, which borrows methods from Automatic Speech Recognition (ASR), for instance, perturbation on speed and pitch, or generating emotional speech utilizing generative adversarial networks. In this paper, we propose EmoAug, a novel style transfer model to augment emotion expressions, in which a semantic encoder and a paralinguistic encoder represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech in different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. In addition, we can also generate similar numbers of samples for each class to tackle the data imbalance issue. Experimental results on the IEMOCAP dataset demonstrate that EmoAug can successfully transfer different speaking styles while retaining the speaker identity and semantic content. Furthermore, we train a SER model with data augmented by EmoAug and show that it not only surpasses the state-of-the-art supervised and self-supervised methods but also overcomes overfitting problems caused by data imbalance. Some audio samples can be found on our demo website.
翻译:目前,言语情感识别系统(SER)的性能主要受到缺乏大规模标注公司(Corpora)的制约。数据扩增被视为一种很有希望的方法,它借用了自动语音识别(ASR)的方法,例如,速度和声调的扰动,或者利用基因对抗网络产生情绪性言论。在本文中,我们提议EmoAug,一种新颖的风格转换模式,以增加情感表达方式,其中语言编码器和音频编码器分别代表口头和非口头信息。此外,一个解码器通过以不受监督的方式调整上述两种信息流动来重建语音信号。一旦培训完成,EmoAug就通过将不同风格的情绪性言语表达方式注入多语系的调调调、节奏和强度,将不同风格的情绪性言语调表达方式注入多语系的调调调调调调,从而增加情绪性言语系的表达。此外,我们还可以为每类制作类似数量的样本,以解决数据不平衡问题。IEMOC数据库的实验结果表明,EmoAug能够成功地转换一些不同的语言风格,同时保留演讲者身份和SAI-SUI-deal-de-deal-deal-ex-ex-ex-ex-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-S-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-