Learning modality-fused representations and processing unaligned multimodal sequences are meaningful and challenging in multimodal emotion recognition. Existing approaches use directional pairwise attention or a message hub to fuse language, visual, and audio modalities. However, those approaches introduce information redundancy when fusing features and are inefficient without considering the complementarity of modalities. In this paper, we propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multimodal sequences. Specifically, we first perform feature extraction for the three modalities respectively to obtain the local structure of the sequences. Then, we design a novel transformer with cross-modal blocks (CB-Transformer) that enables complementary learning of different modalities, mainly divided into local temporal learning,cross-modal feature fusion and global self-attention representations. In addition, we splice the fused features with the original features to classify the emotions of the sequences. Finally, we conduct word-aligned and unaligned experiments on three challenging datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI. The experimental results show the superiority and efficiency of our proposed method in both settings. Compared with the mainstream methods, our approach reaches the state-of-the-art with a minimum number of parameters.
翻译:现有方法使用定向对称式关注或电路枢纽来连接语言、视觉和音频模式;然而,这些方法在引信特性时会产生信息冗余,而且效率低下,而没有考虑到模式的互补性;在本文件中,我们建议建立一个高效的神经网络,以学习与CB-Transtor (LMR-CBT) 的模式叠装式表达方式,以便从不结盟的多式联运序列中识别多式情感。具体地说,我们首先对三种模式分别进行特征提取,以获得这些序列的本地结构。然后,我们设计一个具有跨模式块的新式变异器(CB-Transerector),以便能够对不同模式进行互补学习,主要分为局部时间学习、跨模式特性融合和全球自留意识表达。此外,我们把这些组合与最初的特征连接在一起,以对序列的情感进行分类。最后,我们首先对三种具有挑战性的数据集(IEMOCAP、CMU-MOSI和CMU-MOSEI)和CM-MOSEI, MOI (C) CTEI) 设计一个新型变异变换的变式转换方法,实验性结果显示我们最低的模型方法的优势和最短的方法。