Contrastive learning is a form of distance learning that aims to learn invariant features from two related representations. In this paper, we explore the bold hypothesis that an image and its caption can be simply regarded as two different views of the underlying mutual information, and train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner. We first identify difficulties in learning a generic one-tower model for vision-language pretraining (VLP), and propose OneR as a simple yet effective framework for our goal. We discover intriguing properties that distinguish OneR from the previous works that learn modality-specific representation spaces such as zero-shot object localization, text-guided visual reasoning and multi-modal retrieval, and present analyses to provide insights into this new form of multi-modal representation learning. Thorough evaluations demonstrate the potential of a unified modality-agnostic VLP framework.
翻译:反向学习是一种远程学习形式,目的是从两个相关表述中学习不同特征。在本文中,我们探讨了大胆假设,即图像及其标题可以简单地视为对基本相互信息的两种不同观点,并培训了学习统一愿景语言代表空间的模型,该模型同时以模式-不可知的方式将两种模式编码。我们首先发现在学习通用的一塔模式进行视觉语言预培训(VLP)方面存在困难,并提议将OneR作为我们目标的一个简单而有效的框架。我们发现,OneR与以往学习特定模式代表空间(如零射物体本地化、文本引导视觉推理和多模式检索)的工程有区别。我们提出分析,为这种新形式的多模式代表性学习提供洞察力。 索罗夫评估展示了统一模式-敏感 VLP框架的潜力。