Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning step. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests.
翻译:最近的文本到图像匹配模型将对比性学习应用到大量未精准的图像和句子组合中。 虽然这些模型可以为匹配和随后的零射任务提供强大的分数, 但无法生成给图像的字幕。 在这项工作中, 我们重新利用这些模型来生成描述性文字, 在推论时间给图像, 无需任何进一步的培训或调试步骤。 这是通过将视觉- 语义模型与大型语言模型相结合的方式来完成的, 从两个网络规模模型的知识中受益。 由此产生的字幕比通过监管的字幕方法获得的要少得多。 此外, 作为一种零光化的学习方法, 它非常灵活, 我们展示其进行图像算术的能力, 其输入既可以是图像,也可以是文字, 输出是一个句子。 这样可以让像比较两个图像或解决视觉类比测试这样的新的高层次视觉能力成为可能。