While language Models store a massive amount of world knowledge implicitly in their parameters, even very large models often fail to encode information about rare entities and events, while incurring huge computational costs. Recently, retrieval-augmented models, such as REALM, RAG, and RETRO, have incorporated world knowledge into language generation by leveraging an external non-parametric index and have demonstrated impressive performance with constrained model sizes. However, these methods are restricted to retrieving only textual knowledge, neglecting the ubiquitous amount of knowledge in other modalities like images -- much of which contains information not covered by any text. To address this limitation, we propose the first Multimodal Retrieval-Augmented Transformer (MuRAG), which accesses an external non-parametric multimodal memory to augment language generation. MuRAG is pre-trained with a mixture of large-scale image-text and text-only corpora using a joint contrastive and generative loss. We perform experiments on two different datasets that require retrieving and reasoning over both images and text to answer a given query: WebQA, and MultimodalQA. Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20\% absolute on both datasets and under both distractor and full-wiki settings.
翻译:虽然语言模型在其参数中隐含地储存了大量的世界知识,但即使非常大的模型也往往无法对稀有实体和事件的信息进行编码,同时造成巨大的计算成本。最近,检索增强模型,如SeliveM、RAG和RETRO等,通过利用外部非参数指数将世界知识纳入语言生成,并在模型大小有限的情况下展示了令人印象深刻的性能。然而,这些方法仅限于只检索文本知识,忽略了其他模式中普遍存在的大量知识,例如图像 -- -- 其中很多包含任何文本都无法涵盖的信息。为了应对这一限制,我们建议了第一个多式检索检索启动变异器(MuRAG),这些变异模型利用外部非参数多式联运记忆来增加语言生成。 MuRAG在接受大规模图像文本和只使用文本的组合的训练之前,使用对比和感化损失的组合。我们在两个不同的数据集上进行了实验,需要对图像和文本进行重新检索和推理。为了回答一个给定的查询:WebQA, 和多式变式变式变式的模型,既显示绝对数据结果,又显示现有10度低于模型。