Humans are sophisticated at reading interlocutors' emotions from multimodal signals, such as speech contents, voice tones and facial expressions. However, machines might struggle to understand various emotions due to the difficulty of effectively decoding emotions from the complex interactions between multimodal signals. In this paper, we propose a multimodal emotion analysis framework, InterMulti, to capture complex multimodal interactions from different views and identify emotions from multimodal signals. Our proposed framework decomposes signals of different modalities into three kinds of multimodal interaction representations, including a modality-full interaction representation, a modality-shared interaction representation, and three modality-specific interaction representations. Additionally, to balance the contribution of different modalities and learn a more informative latent interaction representation, we developed a novel Text-dominated Hierarchical High-order Fusion(THHF) module. THHF module reasonably integrates the above three kinds of representations into a comprehensive multimodal interaction representation. Extensive experimental results on widely used datasets, (i.e.) MOSEI, MOSI and IEMOCAP, demonstrate that our method outperforms the state-of-the-art.
翻译:人类从多式信号(如言论内容、声音音调和面部表情)中读取对话者的情感是复杂的,但机器可能难以理解各种情感,因为难以有效地解码多式信号之间复杂的相互作用中的情感。在本文件中,我们提议一个多式情感分析框架(InterMulti),从不同的观点中收集复杂的多式互动,并查明多式信号的情感。我们提议的框架将不同模式的信号分解成三种多式互动代表形式,包括模式-全面互动代表、模式-共享互动代表以及三种模式-特定互动代表。此外,为了平衡不同模式的贡献并学习更知情的潜在互动代表,我们开发了一个新的以文本为主的高阶组合模块。THHHF模块合理地将上述三种表达形式整合到一个全面的多式互动代表形式中。对广泛使用的数据集(即MOSEI、MOSI和IEMOCAP)的广泛实验结果,表明我们的方法超越了最新技术。