Aligning the visual and language spaces requires to train deep neural networks from scratch on giant multimodal datasets; CLIP trains both an image and a text encoder, while LiT manages to train just the latter by taking advantage of a pretrained vision network. In this paper, we show that sparse relative representations are sufficient to align text and images without training any network. Our method relies on readily available single-domain encoders (trained with or without supervision) and a modest (in comparison) number of image-text pairs. ASIF redefines what constitutes a multimodal model by explicitly disentangling memory from processing: here the model is defined by the embedded pairs of all the entries in the multimodal dataset, in addition to the parameters of the two encoders. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multimodal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.
翻译:视觉和语言空间对齐要求从零开始对巨型多式联运数据集进行深层神经网络培训; CLIP 既培训图像,又培训文字编码器,而LIT 则利用预先培训的视觉网络对后者进行培训。在本文中,我们显示,缺乏相对代表性足以在不经过任何培训的情况下对文本和图像进行匹配。我们的方法依靠的是现成的单视域编码器(经过培训或没有监督)和少量的(比较)图像文本配对。 ASIF 明确将记忆从处理中分离出来,从而重新定义了构成多式联运模型的内容:此模型是由多式联运数据集中所有条目的嵌入式对子所定义的,此外还有两个编码器的参数。 标准零镜头实验显示了图像文本模型的典型传输能力。 总的来说,我们的方法代表了基础多式联运模型的简单但令人惊讶的强大基准,提出了数据效率和机器学习中检索作用的重要问题。