We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process and generate arbitrarily interleaved image-and-text data. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.
翻译:我们提出一种有效的方法,将经过预先训练的只有文本的语言模型放到视觉领域,使其能够处理和生成任意的相隔图像和文本数据。我们的方法利用从大规模只使用文本的预培训中学习的语言模型的能力,例如文文本学习和自由格式的文本生成。我们保持语言模型冻结,微调输入和输出线性层,以便能够进行跨模式互动。这使我们的模式能够任意处理图像和文本的互换输入,产生与检索的图像相交的自由格式文本。我们在背景图像检索和多式对话等基础任务上取得了很强的零光效果,并展示了令人信服的互动能力。我们的方法与任何现成的语言模型合作,并为在视觉基础上利用预先训练的语言模型的有效、一般解决方案铺平了道路。