Emotions are subjective constructs. Recent end-to-end speech emotion recognition systems are typically agnostic to the subjective nature of emotions, despite their state-of-the-art performances. In this work, we introduce an end-to-end Bayesian neural network architecture to capture the inherent subjectivity in emotions. To the best of our knowledge, this work is the first to use Bayesian neural networks for speech emotion recognition. At training, the network learns a distribution of weights to capture the inherent uncertainty related to subjective emotion annotations. For this, we introduce a loss term which enables the model to be explicitly trained on a distribution of emotion annotations, rather than training them exclusively on mean or gold-standard labels. We evaluate the proposed approach on the AVEC'16 emotion recognition dataset. Qualitative and quantitative analysis of the results reveal that the proposed model can aptly capture the distribution of subjective emotion annotations with a compromise between mean and standard deviation estimations.
翻译:情感是主观的构思。 最近的端到端语音情绪识别系统尽管表现最先进,但通常对情绪的主观性具有不可知性。 在这项工作中,我们引入了端到端的贝叶西亚神经网络架构,以捕捉情感的固有主观主观性。 根据我们的知识,这项工作是首先使用贝叶西亚神经网络来表达情绪识别。 在培训中,网络学习了分量,以捕捉与主观情感说明有关的内在不确定性。 为此,我们引入了一个损失术语,使模型能够明确接受情感说明的传播培训,而不是仅对其进行中值或金质标准标签的培训。我们评估了AVEC'16情感识别数据集的拟议方法。对结果的定性和定量分析表明,拟议的模型可以恰当地捕捉主观情感说明的分布,并在平均和标准偏差估计之间达成妥协。