Emotions are subjective constructs. Recent end-to-end speech emotion recognition systems are typically agnostic to the subjective nature of emotions, despite their state-of-the-art performance. In this work, we introduce an end-to-end Bayesian neural network architecture to capture the inherent subjectivity in the arousal dimension of emotional expressions. To the best of our knowledge, this work is the first to use Bayesian neural networks for speech emotion recognition. At training, the network learns a distribution of weights to capture the inherent uncertainty related to subjective arousal annotations. To this end, we introduce a loss term that enables the model to be explicitly trained on a distribution of annotations, rather than training them exclusively on mean or gold-standard labels. We evaluate the proposed approach on the AVEC'16 dataset. Qualitative and quantitative analysis of the results reveals that the proposed model can aptly capture the distribution of subjective arousal annotations, with state-of-the-art results in mean and standard deviation estimations for uncertainty modeling.
翻译:情感是主观的构思。 最近端到端的语音情绪识别系统尽管表现最先进,但通常对情绪的主观性质具有不可知性。 在这项工作中,我们引入了端到端的巴伊西亚神经网络架构,以捕捉情感表达的振奋层面的内在主观主观性。据我们所知,这项工作是第一个使用拜伊西亚神经网络来表达情感识别的。在培训中,网络学习了分量,以捕捉与主观振奋说明有关的内在不确定性。为此,我们引入了一个损失术语,使模型能够明确接受关于说明分布的培训,而不是专门进行关于中值或金质标准标签的培训。我们评估了AVEC'16数据集的拟议方法。对结果的定性和定量分析表明,拟议的模型可以恰当地捕捉主观的振动性描述的分布,其结果是用于不确定性模型的中值和标准偏差估计。