Inspired by retrieval-augmented language generation and pretrained Vision and Language (V&L) encoders, we present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore, as opposed to the image alone. The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT, while the decoder attends to the multimodal encoder representations, benefiting from the extra textual evidence from the retrieved captions. Experimental results on the COCO dataset show that image captioning can be effectively formulated from this new perspective. Our model, named EXTRA, benefits from using captions retrieved from the training dataset, and it can also benefit from using an external dataset without the need for retraining. Ablation studies show that retrieving a sufficient number of captions (e.g., k=5) can improve captioning quality. Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks.
翻译:在检索强化的语言生成和预先培训的视觉和语言(V & L)编码器的启发下,我们展示了一种新的图像字幕方法,根据输入图像和从数据存储处检索的一组字幕生成句子,而不是仅仅图像。模型中的编码器使用预先培训的 V & L BERT 联合处理图像和检索字幕,而解码器则利用从检索的字幕中获取的额外文字证据进入多式编码器的演示。COCO数据集的实验结果表明,图像字幕可以从这一新的角度有效制作。我们的模型,名为 EXTRA, 使用从培训数据集检索的字幕的好处, 也可以在不需要再培训的情况下使用外部数据集。 调整研究表明, 重新获取足够数量的字幕( e.g., k=5) 能够提高字幕质量。 我们的工作有助于使用事先培训的 V & L 编码器进行基因化任务, 而不是标准分类任务 。