Speech emotion recognition is a challenging research topic that plays a critical role in human-computer interaction. Multimodal inputs further improve the performance as more emotional information is used. However, existing studies learn all the information in the sample while only a small portion of it is about emotion. The redundant information will become noises and limit the system performance. In this paper, a key-sparse Transformer is proposed for efficient emotion recognition by focusing more on emotion related information. The proposed method is evaluated on the IEMOCAP and LSSED. Experimental results show that the proposed method achieves better performance than the state-of-the-art approaches.
翻译:语音情绪识别是一个具有挑战性的研究课题,在人与计算机的互动中发挥着关键作用。随着更多的情感信息被使用,多模式投入进一步提高了性能。但是,现有的研究在学习样本中的所有信息的同时,只有一小部分是关于情感的信息。多余的信息将变成噪音,并限制系统性能。在本文中,建议使用一个关键开关变换器,通过更多地关注情感相关信息来提高情感识别效率。在IMOC和LSSED上对拟议方法进行了评估。实验结果显示,拟议方法的性能优于最先进的方法。</s>