Multimodal emotion recognition (MER) is a fundamental complex research problem due to the uncertainty of human emotional expression and the heterogeneity gap between different modalities. Audio and text modalities are particularly important for a human participant in understanding emotions. Although many successful attempts have been designed multimodal representations for MER, there still exist multiple challenges to be addressed: 1) bridging the heterogeneity gap between multimodal features and model inter- and intra-modal interactions of multiple modalities; 2) effectively and efficiently modelling the contextual dynamics in the conversation sequence. In this paper, we propose Cross-Modal RoBERTa (CM-RoBERTa) model for emotion detection from spoken audio and corresponding transcripts. As the core unit of the CM-RoBERTa, parallel self- and cross- attention is designed to dynamically capture inter- and intra-modal interactions of audio and text. Specially, the mid-level fusion and residual module are employed to model long-term contextual dependencies and learn modality-specific patterns. We evaluate the approach on the MELD dataset and the experimental results show the proposed approach achieves the state-of-art performance on the dataset.
翻译:由于人类情感表达的不确定性和不同模式之间的差异,多模式情感识别是一个根本性的复杂研究问题。音频和文本模式对于人类理解情感的参与者特别重要。虽然为市面设计了许多成功的多式联运代表,但仍然存在多种有待解决的挑战:(1) 弥合多式联运特征与多种模式的模型之间和模式之间和模式内部互动之间的异质性差距;(2) 以对话序列中的背景动态为有效、高效的模型。我们在本文件中提议跨模式RoBERTA(CM-ROBERTA)模式,用于通过口头录音和相应记录进行情感检测。作为MM-ROBERTA的核心单位,平行的自我和交叉关注旨在动态地捕捉音频和文本之间的动态和模式内部互动。特别是,中层聚和残余模块用于模拟长期环境依赖和学习特定模式。我们评估MELD数据集的方法和实验结果显示拟议方法实现了数据集的状态性能。