Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens based on the past context and on text retrieved from the external memory. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality. Our work opens up new avenues for improving image captioning models at larger scale.
翻译:图像字幕模型旨在通过提供输入图像的自然语言描述来连接视觉和语言。 在过去几年中,任务是通过学习参数模型和提出视觉特征提取进步,或通过建模更好的多模式连接。 在本文中,我们调查开发一个带有 kNN 内存的图像字幕方法,从外部外源中提取知识,以帮助生成过程。我们的建筑将基于视觉相似性的知识检索器、一个不同的编码器和一个 kNN 增强注意层结合起来,以预测基于过去背景的标志和从外部记忆中提取的文字。在COCOCO数据集上进行的实验结果表明,使用明确的外部记忆可以帮助生成过程并提高字幕质量。我们的工作开辟了在更大范围内改进图像字幕模型的新途径。