Previous vision-language pre-training models mainly construct multi-modal inputs with tokens and objects (pixels) followed by performing cross-modality interaction between them. We argue that the input of only tokens and object features limits high-level semantic alignment like phrase-to-region grounding. Meanwhile, multi-level alignments are inherently consistent and able to facilitate the representation learning synergistically. Therefore, in this paper, we propose to learn Multi-level semantic alignment for Vision-language Pre-TRaining (MVPTR). In MVPTR, we follow the nested structure of both modalities to introduce concepts as high-level semantics. To ease the learning from multi-modal multi-level inputs, our framework is split into two stages, the first stage focuses on intra-modality multi-level representation learning, the second enforces interactions across modalities via both coarse-grained and fine-grained semantic alignment tasks. In addition to the commonly used image-text matching and masked language model tasks, we introduce a masked concept recovering task in the first stage to enhance the concept representation learning, and two more tasks in the second stage to explicitly encourage multi-level alignments across modalities. Our code is available at https://github.com/Junction4Nako/mvp_pytorch.
翻译:先前的愿景前培训模式主要是建构具有象征和对象的多模式投入(像素),然后在它们之间进行跨模式互动。我们争辩说,只有象征和对象特性的投入限制了高层次语义一致,如语句到区域基点。与此同时,多层次的调整具有内在一致性,能够促进代表性学习的协同性。因此,我们在本文件中提议学习多层次的语义协调,用于愿景语言前TRI(MVPTR)。在MVPTR中,我们遵循两种模式的嵌套结构,将概念引入高层次语义。为了从多模式的多层次投入中缓解学习,我们的框架分为两个阶段,第一阶段侧重于模式内部的多层次代表性学习,第二阶段通过粗度和精度精度的语义调调调调调调校正。除了常用的图像文本匹配和遮掩的语言模式任务外,我们还引入了在第一阶段恢复概念的第二个掩罩概念结构,以加强概念的多层次化,4 在多层次的版本中鼓励多层次的学习模式。