Multimodal representation learning is a challenging task in which previous work mostly focus on either uni-modality pre-training or cross-modality fusion. In fact, we regard modeling multimodal representation as building a skyscraper, where laying stable foundation and designing the main structure are equally essential. The former is like encoding robust uni-modal representation while the later is like integrating interactive information among different modalities, both of which are critical to learning an effective multimodal representation. Recently, contrastive learning has been successfully applied in representation learning, which can be utilized as the pillar of the skyscraper and benefit the model to extract the most important features contained in the multimodal data. In this paper, we propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously. Specifically, we devise uni-modal contrastive coding with an efficient uni-modal feature augmentation strategy to filter inherent noise contained in acoustic and visual modality and acquire more robust uni-modality representations. Besides, a pseudo siamese network is presented to predict representation across different modalities, which successfully captures cross-modal dynamics. Moreover, we design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment. Extensive experiments conducted on two public datasets demonstrate that our method surpasses the state-of-the-art methods.
翻译:以往的工作主要侧重于单式培训前培训或跨式融合,这是一项具有挑战性的任务。事实上,我们认为模拟多式联运代表模式是一座摩天大楼,在摩天大楼中奠定稳定的基础和设计主要结构同样重要。前者是编码强大的单式代表模式,而后一种则是将不同模式的互动信息整合在一起,这两种模式对学习有效的多式联运代表模式都至关重要。最近,在代表性学习中成功地应用了对比学习,这可以用作摩天大楼的支柱,有利于模型,以提取多式联运数据中最重要的特征。在本文中,我们提出了一个名为多式反向学习的新框架(MMMCL),用于多式代表模式同时捕捉到内部和内部模式之间的动态。具体地说,我们设计单式反向调调调调调调调调调调调调调和增强战略,以过滤音调和视觉模式的内在噪音,并获得更强有力的单一式代表模式。此外,一个假造型网络可以预测跨不同模式的多式分析模式的多式分析,并展示了我们所演化的两种模式,我们所演化的两套式的模型。