Entity-aware image captioning aims to describe named entities and events related to the image by utilizing the background knowledge in the associated article. This task remains challenging as it is difficult to learn the association between named entities and visual cues due to the long-tail distribution of named entities. Furthermore, the complexity of the article brings difficulty in extracting fine-grained relationships between entities to generate informative event descriptions about the image. To tackle these challenges, we propose a novel approach that constructs a multi-modal knowledge graph to associate the visual objects with named entities and capture the relationship between entities simultaneously with the help of external knowledge collected from the web. Specifically, we build a text sub-graph by extracting named entities and their relationships from the article, and build an image sub-graph by detecting the objects in the image. To connect these two sub-graphs, we propose a cross-modal entity matching module trained using a knowledge base that contains Wikipedia entries and the corresponding images. Finally, the multi-modal knowledge graph is integrated into the captioning model via a graph attention mechanism. Extensive experiments on both GoodNews and NYTimes800k datasets demonstrate the effectiveness of our method.
翻译:实体认知图像字幕的目的是利用相关文章中的背景知识,描述与图像相关的被命名实体和事件。由于名称实体的长尾分布,很难了解被命名实体和视觉提示之间的联系。此外,由于文章的复杂性,很难在实体之间提取细微关系,以生成关于图像的信息性事件描述。为了应对这些挑战,我们提出了一个新颖的方法,即构建一个多模式知识图,将视觉对象与被命名实体联系起来,并在从网上收集的外部知识的帮助下同时捕捉实体之间的关系。具体地说,我们通过从文章中提取被命名实体及其关系来建立一个文本子图,并通过探测图像中的对象来建立一个图像子图。为了连接这两个子图,我们建议建立一个跨模式实体匹配模块,使用包含维基百科条目和相应图像的知识基础来培训。最后,多模式知识图通过图表关注机制纳入字幕模型。对GoodNews和NYTime800k数据集进行了广泛的实验,以展示我们的方法的有效性。