Although an object may appear in numerous contexts, we often describe it in a limited number of ways. This happens because language abstracts away visual variation to represent and communicate concepts. Building on this intuition, we propose an alternative approach to visual learning: using language similarity to sample semantically similar image pairs for contrastive learning. Our approach deviates from image-based contrastive learning by using language to sample pairs instead of hand-crafted augmentations or learned clusters. Our approach also deviates from image-text contrastive learning by relying on pre-trained language models to guide the learning rather than minimize a cross-modal similarity. Through a series of experiments, we show that language-guided learning can learn better features than both image-image and image-text representation learning approaches.
翻译:虽然一个对象可能出现在多种情况下,但我们经常以有限的方式描述它。 这是因为语言摘要排除视觉变异来表达和交流概念。 以这种直觉为基础,我们提出了视觉学习的替代方法:使用语言相似性来抽样地震相似的图像配对,以进行对比式学习。 我们的方法与基于图像的对比学习不同,使用语言来样本配对,而不是手工制作的扩增或学习的集群。 我们的方法也不同于图像-文字对比学习,依靠预先训练的语言模型来指导学习,而不是尽量减少跨模式的相似性。 我们通过一系列实验,显示语言指导学习可以学习比图像图像和图像-文字表述学习方法更好的特征。