Current image captioning systems perform at a merely descriptive level, essentially enumerating the objects in the scene and their relations. Humans, on the contrary, interpret images by integrating several sources of prior knowledge of the world. In this work, we aim to take a step closer to producing captions that offer a plausible interpretation of the scene, by integrating such contextual information into the captioning pipeline. For this we focus on the captioning of images used to illustrate news articles. We propose a novel captioning method that is able to leverage contextual information provided by the text of news articles associated with an image. Our model is able to selectively draw information from the article guided by visual cues, and to dynamically extend the output dictionary to out-of-vocabulary named entities that appear in the context source. Furthermore we introduce `GoodNews', the largest news image captioning dataset in the literature and demonstrate state-of-the-art results.
翻译:目前的图像字幕系统仅以描述性水平运作, 主要是列出现场的物体及其关系。 相反, 人类通过整合世界先前知识的几个来源来解读图像。 在这项工作中, 我们的目标是更接近于制作能够对场景做出合理解释的字幕, 将这种背景信息整合到字幕管道中。 为此, 我们侧重于用于描述新闻文章的图像的字幕。 我们提出一种新的字幕方法, 能够利用与图像相关的新闻文章文本所提供的背景信息。 我们的模型能够有选择地从文章中提取由视觉提示引导的信息, 并动态地将输出词典扩展至上下文源中出现的外语界命名实体。 此外, 我们引入了“ GoodNews ”, 这是在文献中描述数据集的最大新闻图像, 并展示最新的结果 。