Speech emotion recognition is a challenging and important research topic that plays a critical role in human-computer interaction. Multimodal inputs can improve the performance as more emotional information is used for recognition. However, existing studies learnt all the information in the sample while only a small portion of it is about emotion. Moreover, under the multimodal framework, the interaction between different modalities is shallow and insufficient. In this paper, a keysparse Transformer is proposed for efficient SER by only focusing on emotion related information. Furthermore, a cascaded cross-attention block, which is specially designed for multimodal framework, is introduced to achieve deep interaction between different modalities. The proposed method is evaluated by IEMOCAP corpus and the experimental results show that the proposed method gives better performance than the state-of-theart approaches.
翻译:语音情绪识别是一个具有挑战性的重要研究课题,在人与计算机的互动中发挥着关键作用。多模式投入可以提高性能,因为更多的情感信息被用于识别。但是,现有研究只学到了样本中的所有信息,而其中只有一小部分是情感信息。此外,在多式联运框架下,不同模式之间的互动是浅而不充分的。在本文中,为高效的SER建议了一个关键开关变换器,仅侧重于情感相关信息。此外,还引入了一个为多式联运框架专门设计的级联交叉注意块,以实现不同模式之间的深入互动。拟议的方法由IMOC文集和实验结果表明,拟议方法的绩效优于最先进的方法。