The wide application of smart devices enables the availability of multimodal data, which can be utilized in many tasks. In the field of multimodal sentiment analysis (MSA), most previous works focus on exploring intra- and inter-modal interactions. However, training a network with cross-modal information (language, visual, audio) is still challenging due to the modality gap, and existing methods still cannot ensure to sufficiently learn intra-/inter-modal dynamics. Besides, while learning dynamics within each sample draws great attention, the learning of inter-class relationships is neglected. Moreover, the size of datasets limits the generalization ability of existing methods. To address the afore-mentioned issues, we propose a novel framework HyCon for hybrid contrastive learning of tri-modal representation. Specifically, we simultaneously perform intra-/inter-modal contrastive learning and semi-contrastive learning (that is why we call it hybrid contrastive learning), with which the model can fully explore cross-modal interactions, preserve inter-class relationships and reduce the modality gap. Besides, a refinement term is devised to prevent the model falling into a sub-optimal solution. Moreover, HyCon can naturally generate a large amount of training pairs for better generalization and reduce the negative effect of limited datasets. Extensive experiments on public datasets demonstrate that our proposed method outperforms existing works.
翻译:智能装置的广泛应用使得能够提供多式联运数据,这些数据可用于许多任务。在多式联运情绪分析(MSA)领域,大多数以前的工作都侧重于探讨多式联运内部和多式联运之间的相互作用。然而,由于模式上的差距,培训跨模式信息(语言、视觉、音频)的网络仍然具有挑战性,现有方法仍然不能确保充分学习内部/内部动态。此外,尽管每个样本中的学习动态都引起极大注意,但学习阶级间关系的学习却被忽视。此外,数据集的规模限制了现有方法的普及能力。为了解决上述问题,我们提出了一个新的框架,即三模式代表制混合对比学习的节奏。具体地说,我们同时进行内部/相互对比学习和半互动学习(这就是为什么我们称之为混合对比学习),因此模型可以充分探索跨模式互动、维护阶级间关系和缩小模式差距。此外,还设计了一个精细化的术语,以防止模型变成亚操作式解决方案的能力。为了解决上述问题,我们提出了一个新的框架,即三模式的混合对比学习,我们同时进行内部/相互对比学习和半互动学习(这就是为什么我们称之为混合对比学习)的学习。此外,HyCon可以自然地展示我们现有的大规模数据实验,以缩小现有数据实验,从而缩小了我们一般实验。