With the proliferation of user-generated online videos, Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently. Despite significant progress, there are still two major challenges on the way towards robust MSA: 1) inefficiency when modeling cross-modal interactions in unaligned multimodal data; and 2) vulnerability to random modality feature missing which typically occurs in realistic settings. In this paper, we propose a generic and unified framework to address them, named Efficient Multimodal Transformer with Dual-Level Feature Restoration (EMT-DLFR). Concretely, EMT employs utterance-level representations from each modality as the global multimodal context to interact with local unimodal features and mutually promote each other. It not only avoids the quadratic scaling cost of previous local-local cross-modal interaction methods but also leads to better performance. To improve model robustness in the incomplete modality setting, on the one hand, DLFR performs low-level feature reconstruction to implicitly encourage the model to learn semantic information from incomplete data. On the other hand, it innovatively regards complete and incomplete data as two different views of one sample and utilizes siamese representation learning to explicitly attract their high-level representations. Comprehensive experiments on three popular datasets demonstrate that our method achieves superior performance in both complete and incomplete modality settings.
翻译:随着用户生成的在线视频的激增,多时感分析(MSA)最近引起了越来越多的关注。尽管取得了显著进展,但在实现强劲的特派任务生活津贴方面仍然存在两大挑战:(1) 模拟不统一多式联运数据中的跨模式互动时效率低下;(2) 通常在现实环境中出现的随机模式缺失特征的脆弱性。在本文件中,我们提议了一个通用的统一框架来解决这些问题,名为高效的多式变异器,具有双级功能恢复(EMT-DLFR),具体地说,EMT采用每种模式的发音级别代表,作为全球多式联运背景下与当地单一模式特征互动并相互促进的两种不同观点。它不仅避免了以往地方-地方跨模式互动方法的四级缩放成本,而且还导致更好的绩效。为了改进不完全模式设置的模型的稳健性,DLFR进行了低级特征重建,以隐含地鼓励模型从不完整的数据中学习语义性信息。另一方面,它创新地将完整和不完整的数据视为一个样本的两种不同观点,并利用Siamexime代表制的三种高级模型,从而明确展示其高级业绩。