Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant knowledge fetched by a retriever from external memory (e.g., multimodal documents on the web). Specifically, we implement a retriever using the pretrained CLIP model and a generator using the CM3 Transformer architecture, and train this model using the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate mixtures of text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30% of DALL-E). Moreover, we show that RA-CM3 exhibits novel capabilities such as knowledge-intensive image generation and multimodal in-context learning.
翻译:DALL-E和CM3等最新多式联运模型在文字到图像和图像到文字生成方面取得了显著进展,然而,这些模型在模型参数中存储了所有学到的知识(例如Eiffel塔的外观),要求越来越多的模型和培训数据以获取更多的知识。为了以更可扩展和模块化的方式整合知识,我们提议了一个检索强化的多式联运模型,使一个基础多式联运模型(发电机)能够参考外部记忆检索器(例如网上的多式联运文件)获取的相关知识。具体地说,我们使用预先培训的CLIP模型和发电机(例如Eiffel塔的外观)在模型参数参数中存储了所有知识(例如Eiffel Tower的外观),并使用LAION数据集培训了这一模型。我们由此产生的模型(称为Retreval-Augmented CM3(RA-CM3))是第一个能够检索和生成文本和图像混合物的多式联运模型(发电机)模型,我们表明,RA-CM3大大超越了DL-E和CM3的基线模型改进模型,例如DLL-CO-C3的图像和CE-CE-CE-AFIDS-CM-CA-CM-CM-C-C-C-C-CM-CM-CM-CM-CM-I-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-CM-CM-M-M-M-CM-M-M-M-M-M-M-M-C-C-C-C-M-M-M-M-M-C-M-M-M-M-M-C-C-C-M-C-C-M-M-M-M-M-M-M-M-M-M-M-C-MDMDMD-C-M-M-M-C-C-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-