Existing vision-language pre-training (VLP) methods primarily rely on paired image-text datasets, which are either annotated by enormous human labors, or crawled from the internet followed by elaborate data cleaning techniques. To reduce the dependency on well-aligned image-text pairs, it is promising to directly leverage the large-scale text-only and image-only corpora. This paper proposes a data augmentation method, namely cross-modal CutMix (CMC), for implicit cross-modal alignment learning in unpaired VLP. Specifically, CMC transforms natural sentences from the textual view into a multi-modal view, where visually-grounded words in a sentence are randomly replaced by diverse image patches with similar semantics. There are several appealing proprieties of the proposed CMC. First, it enhances the data diversity while keeping the semantic meaning intact for tackling problems where the aligned data are scarce; Second, by attaching cross-modal noise on uni-modal data, it guides models to learn token-level interactions across modalities for better denoising. Furthermore, we present a new unpaired VLP method, dubbed as VLMixer, that integrates CMC with contrastive learning to pull together the uni-modal and multi-modal views for better instance-level alignments among different modalities. Extensive experiments on five downstream tasks show that VLMixer could surpass previous state-of-the-art unpaired VLP methods.
翻译:现有的视觉培训前方法主要依赖配对图像文本数据集,这些数据集或由巨大的人力劳动加注,或从互联网上爬来,然后是精密的数据清理技术。为了减少对相近图像文本配对的依赖性,有希望直接利用大规模只使用文本和只使用图像的组合体。本文提议了一种数据增强方法,即跨模版CutMix(CMC),用于在未调整的VLP中进行隐含的跨模版校准学习。具体地说,CMC将文字视图中的自然句转换成一种多式视图,在这种视图中,有视觉显示的字句被类似语义的多种图像拼接的随机取代。首先,它能提高数据的多样性,同时保持对处理统一数据稀缺的问题的语义完整;第二,通过将跨模版噪音附加在单级VMML数据中,它指导各种模式学习象征性水平的交互互动,以更好地进行脱调。此外,我们展示了新的VML模式,将新的非模版式VML方法整合成新的对比模式。