Modern image captioning system relies heavily on extracting knowledge from images to capture the concept of a static story. In this paper, we propose a textual visual context dataset for captioning, in which the publicly available dataset COCO Captions (Lin et al., 2014) has been extended with information about the scene (such as objects in the image). Since this information has a textual form, it can be used to leverage any NLP task, such as text similarity or semantic relation methods, into captioning systems, either as an end-to-end training strategy or a post-processing based approach.
翻译:现代图像描述系统在从图像中提取知识来捕捉静态故事的概念方面扮演着重要角色。在本文中,我们提出了一个文本视觉环境数据集用于图像描述。在公开可用的 COCO Captions 数据集的基础上,我们增加了场景信息(例如图像中的对象)。由于此信息以文本形式存在,因此可以将任何 NLP 任务(如文本相似性或语义关系方法)用于图像描述系统,无论是作为端到端训练策略还是基于后处理的方法。