The goal of Speech Emotion Recognition (SER) is to enable computers to recognize the emotion category of a given utterance in the same way that humans do. The accuracy of SER is strongly dependent on the validity of the utterance-level representation obtained by the model. Nevertheless, the ``dark knowledge" carried by non-target classes is always ignored by previous studies. In this paper, we propose a hierarchical network, called DKDFMH, which employs decoupled knowledge distillation in a deep convolutional neural network with a fused multi-head attention mechanism. Our approach applies logit distillation to obtain higher-level semantic features from different scales of attention sets and delve into the knowledge carried by non-target classes, thus guiding the model to focus more on the differences between sentiment features. To validate the effectiveness of our model, we conducted experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. We achieved competitive performance, with 79.1% weighted accuracy (WA) and 77.1% unweighted accuracy (UA). To the best of our knowledge, this is the first time since 2015 that logit distillation has been returned to state-of-the-art status.
翻译:语音情感认识(SER)的目标是使计算机能够以与人类相同的方式识别特定发音的情感类别。 SER的准确性在很大程度上取决于模型获得的发音级别代表的有效性。 然而,由非目标类传播的“达克知识”总是被先前的研究所忽略。 在本文中,我们建议建立一个等级网络,称为DKDFMMH, 在一个具有混合多头关注机制的深层神经网络中采用分解知识蒸馏法, 在一个具有混合式多头脑关注机制的深层神经网络中采用分解知识蒸馏法。 我们的方法是用对日志蒸馏法从不同的关注范围获取更高层次的语义特征,并深入到非目标类传播的知识中,从而指导该模式更多地关注感性特征之间的差异。为了验证我们的模型的有效性,我们进行了互动情感-Dyadic Motion采集(IEMOCAP)数据集的实验。我们实现了竞争性表现,79.1%的加权精度(WA)和77.1%的未加权精度(UA) 。这是我们所了解的最佳情况,这是自2015年以来第一次恢复的对状态。</s>