Speech emotion recognition (SER) is a key technology to enable more natural human-machine communication. However, SER has long suffered from a lack of public large-scale labeled datasets. To circumvent this problem, we investigate how unsupervised representation learning on unlabeled datasets can benefit SER. We show that the contrastive predictive coding (CPC) method can learn salient representations from unlabeled datasets, which improves emotion recognition performance. In our experiments, this method achieved state-of-the-art concordance correlation coefficient (CCC) performance for all emotion primitives (activation, valence, and dominance) on IEMOCAP. Additionally, on the MSP- Podcast dataset, our method obtained considerable performance improvements compared to baselines.
翻译:语音感官识别(SER)是使人类机器通信更自然的关键技术。然而,SER长期以来一直缺乏公共大规模标签数据集。为避免这一问题,我们调查在未贴标签的数据集上进行不受监督的代言学习如何有利于SER。我们显示,对比化的预测编码(CPC)方法可以从未贴标签的数据集中学习显著的表达方式,这提高了情感识别的性能。在我们的实验中,SER在IEMOCAP上实现了所有情感原始人(活动、价值和主导地位)的最先进的协调相关系数(CCC)。此外,在MSP-Podcast数据集上,我们的方法与基线相比取得了相当大的性能改进。