Representation learning for speech emotion recognition is challenging due to labeled data sparsity issue and lack of gold standard references. In addition, there is much variability from input speech signals, human subjective perception of the signals and emotion label ambiguity. In this paper, we propose a machine learning framework to obtain speech emotion representations by limiting the effect of speaker variability in the speech signals. Specifically, we propose to disentangle the speaker characteristics from emotion through an adversarial training network in order to better represent emotion. Our method combines the gradient reversal technique with an entropy loss function to remove such speaker information. Our approach is evaluated on both IEMOCAP and CMU-MOSEI datasets. We show that our method improves speech emotion classification and increases generalization to unseen speakers.
翻译:此外,输入语音信号、人类对信号的主观感知以及情感标签的模糊性等差异很大。在本文中,我们提出一个机器学习框架,通过限制语音信号中演讲者变异的影响来获取语音情感表征。具体地说,我们提议通过对抗性培训网络将演讲者特征与情绪脱钩,以便更好地代表情感。我们的方法是将梯度回转技术与取消这些演讲者信息的增缩功能结合起来。我们的方法在IMOC和CMU-MOSEI数据集中都进行了评估。我们表明,我们的方法改进了语言情感分类,增加了对看不见演讲者的概括性。