Multimodal sentiment analysis in videos is a key task in many real-world applications, which usually requires integrating multimodal streams including visual, verbal and acoustic behaviors. To improve the robustness of multimodal fusion, some of the existing methods let different modalities communicate with each other and modal the crossmodal interaction via transformers. However, these methods only use the single-scale representations during the interaction but forget to exploit multi-scale representations that contain different levels of semantic information. As a result, the representations learned by transformers could be biased especially for unaligned multimodal data. In this paper, we propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis. On the whole, the "multi-scale" mechanism is capable of exploiting the different levels of semantic information of each modality which are used for fine-grained crossmodal interactions. Meanwhile, each modality learns its feature hierarchies via integrating the crossmodal interactions from multiple level features of its source modality. In this way, each pair of modalities progressively builds feature hierarchies respectively in a cooperative manner. The empirical results illustrate that our MCMulT model not only outperforms existing approaches on unaligned multimodal sequences but also has strong performance on aligned multimodal sequences.
翻译:视频中多式情绪分析是许多现实世界应用中的一项关键任务,通常需要整合多式联运流,包括视觉、言语和声学行为。为了提高多式联运融合的稳健性,有些现有方法允许不同模式相互沟通,并通过变压器模拟跨式互动。不过,这些方法只在互动中使用单一尺度的表达方式,而忘记利用包含不同程度的语义信息的多尺度表达方式。因此,变压器所学的表述方式可能会有偏差,特别是不匹配的多式联运数据。在本文中,我们提议为多式联运情绪分析建立一个多规模的多式合作多式联运变压器(MCMulT)架构。总体而言,“多尺度”机制能够利用每种模式的不同层次的语义信息,用于细微的跨式互动。同时,每种模式通过整合其来源模式多层次的跨模式互动来学习其特征的等级结构。在这种方式中,每种模式的组合都以合作方式逐步分别建立特征分级结构。实验结果显示,我们的各种模式的模型并不只以不统一模式的形式排列。