Although an object may appear in numerous contexts, we often describe it in a limited number of ways. Language allows us to abstract away visual variation to represent and communicate concepts. Building on this intuition, we propose an alternative approach to visual representation learning: using language similarity to sample semantically similar image pairs for contrastive learning. Our approach diverges from image-based contrastive learning by sampling view pairs using language similarity instead of hand-crafted augmentations or learned clusters. Our approach also differs from image-text contrastive learning by relying on pre-trained language models to guide the learning rather than directly minimizing a cross-modal loss. Through a series of experiments, we show that language-guided learning yields better features than image-based and image-text representation learning approaches.
翻译:尽管一个物体可能出现在许多不同的情境中,但我们通常只用有限的描述方式来表达概念。语言能够帮助我们抽象出视觉变化,以表示和传达概念。基于这种直觉,我们提出了一种视觉表示学习的替代方法:使用语言相似性来对抗学习(semantically similar image pairs),以采样语义相似的图像对。我们的方法不同于基于图像的对抗学习,我们通过使用语言相似性来采样视图对,而不是使用手工制作的增强或学习的聚类。我们的方法也不同于图像-文本对抗学习,它依赖于预训练的语言模型来指导学习,而不是直接最小化跨模态损失。通过一系列实验,我们展示了基于语言引导学习比基于图像和图像-文本表示学习方法产生更好的特征。