Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correct corresponding information across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-to-local alignment. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in the limited cross-modal alignment ability of global representations. Therefore, in this paper, we propose a novel Semantic Completion Learning (SCL) task, complementary to existing masked modeling tasks, to facilitate global-to-local alignment. Specifically, the SCL task complements the missing semantics of masked data by capturing the corresponding information from the other modality, promoting learning more representative global features which have a great impact on the performance of downstream tasks. Moreover, we present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.
翻译:跨模式调整对于愿景语言培训前(VLP)模式学习不同模式的正确对应信息至关重要。为此,在NLP培训前领域蒙面语言建模任务的成功激励下,已经为VLP提出了许多蒙面建模任务,以进一步促进跨模式互动。以前蒙面建模任务的核心理念是侧重于在学习地方对地方调整的可见背景基础上重建遮面标志。然而,大多数模式很少注意为遮面数据产生的全球语义特征,导致全球代表机构具有有限的跨模式调整能力。因此,在本文件中,我们提出了一个新的Smanict语完成学习(SCL)任务,以补充现有的蒙面建模任务,促进全球对地方的调整。具体地说,以前蒙面建模任务的核心理念是通过从其他模式获取相应信息,促进学习对下游任务业绩产生重大影响的更具代表性的全球特征,从而补充隐藏数据缺失的语义。此外,我们提出了一种灵活的愿景,即:图像搜索模型,我们同时展示了各种视频版本的版本,让我们的图像访问模式能够执行我们的拟议模式。