Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non-contrastive methods were capable of finer-grained alignment, but required dense annotations that were not scalable. We propose a single stream architecture that aligns images and language at multiple levels: global, fine-grained patch-token, and conceptual/semantic, using two novel tasks: symmetric cross-modality reconstruction (XMM) and a pseudo-labeled key word prediction (PSL). In XMM, we mask input tokens from one modality and use cross-modal information to reconstruct the masked token, thus improving fine-grained alignment between the two modalities. In PSL, we use attention to select keywords in a caption, use a momentum encoder to recommend other important keywords that are missing from the caption but represented in the image, and then train the visual encoder to predict the presence of those keywords, helping it learn semantic concepts that are essential for grounding a textual token to an image region. We demonstrate competitive performance and improved data efficiency on image-text retrieval, grounding, visual question answering/reasoning against larger models and models trained on more data. Code and models available at zaidkhan.me/SIMLA.
翻译:由纯图像和文本进行自我监督的视觉语言预科培训,其纯图像和文本的差别性损失是有效的,但忽视了由于双流结构而导致的细微调整,因为双流结构只将图像和文本表示在全球级别上相互对齐。早些时候,受监督的非相互竞争方法能够进行细微调整,但需要不易缩放的密集说明。我们提出了一个单一流结构,将图像和语言在多个级别上对齐:全球、细微的缩接合式和概念/解释性,使用两个新颖的任务:对称跨模式重建(XMM)和假标签的关键字预测(PSL)。在XMM中,我们掩盖一种模式的输入符号,使用交叉模式来重建蒙面标记,从而改进两种模式之间的细微调整。在PSLSL中,我们关注在标题中选择关键词,使用动力编码编码编码模型来推荐其他从标题中缺失但在图像中代表的一些重要关键关键关键关键词,然后培训视觉编码A类比重关键词来预测具有竞争力的图像的图像检索。我们学习了地面数据,在地面数据,在地面上学习了一种基本数据。