When recognizing emotions from speech, we encounter two common problems: how to optimally capture emotion-relevant information from the speech signal and how to best quantify or categorize the noisy subjective emotion labels. Self-supervised pre-trained representations can robustly capture information from speech enabling state-of-the-art results in many downstream tasks including emotion recognition. However, better ways of aggregating the information across time need to be considered as the relevant emotion information is likely to appear piecewise and not uniformly across the signal. For the labels, we need to take into account that there is a substantial degree of noise that comes from the subjective human annotations. In this paper, we propose a novel approach to attentive pooling based on correlations between the representations' coefficients combined with label smoothing, a method aiming to reduce the confidence of the classifier on the training labels. We evaluate our proposed approach on the benchmark dataset IEMOCAP, and demonstrate high performance surpassing that in the literature. The code to reproduce the results is available at github.com/skakouros/s3prl_attentive_correlation.
翻译:当我们认识到来自言语的情绪时,我们遇到两个共同的问题:如何以最佳方式捕捉来自言语信号的情感相关信息,以及如何最佳地量化或分类噪音主观情绪标签。自我监督的事先培训的表述方法可以有力地捕捉来自能够实现最先进的演讲结果的很多下游任务的信息,包括情感识别。然而,需要考虑更好的时间信息汇总方法,因为相关的情感信息在信号之间可能显得细微不统一。对于标签,我们需要考虑到来自主观人类说明的大量噪音。在本文中,我们建议一种新颖的方法,根据代表系数与标签平滑的关联来关注集合,这种方法旨在降低分类者对培训标签的信心。我们评估我们关于基准数据集 IEMOCAP 的拟议方法,并显示在文献中高性能超过该标准。复制结果的代码可以在 github. com/skouros/spr_attive_corl。