CLIP proved that aligning visual and language spaces is key to solving many vision tasks without explicit training, but required to train image and text encoders from scratch on a huge dataset. LiT improved this by only training the text encoder and using a pre-trained vision network. In this paper, we show that a common space can be created without any training at all, using single-domain encoders (trained with or without supervision) and a much smaller amount of image-text pairs. Furthermore, our model has unique properties. Most notably, deploying a new version with updated training samples can be done in a matter of seconds. Additionally, the representations in the common space are easily interpretable as every dimension corresponds to the similarity of the input to a unique entry in the multimodal dataset. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multi-modal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.
翻译:CLIP证明,调整视觉和语言空间对于在没有明确培训的情况下解决许多视觉任务至关重要,但需要从零开始在一个庞大的数据集上培训图像和文字编码员。LIT仅通过培训文本编码器和使用预先培训的视觉网络来改进这一点。在本文中,我们表明,在没有经过任何培训的情况下,使用单体成像编码器(无论是否受过监督)和数量小得多的图像文本对子可以创建一个共同的空间。此外,我们的模型具有独特的特性。最显著的是,在几秒钟内就可以使用经过更新的培训样本的新版本。此外,在共同空间的表述很容易被解读,因为每个层面都与输入与多式数据集中独特条目的相似。标准零光光视觉基准实验显示了图像文本模型的典型传输能力。总体而言,我们的方法代表了建立多模式的简单但令人惊讶的强大基准,提出了数据效率和机器学习中检索作用等重要问题。