Most current image captioning systems focus on describing general image content, and lack background knowledge to deeply understand the image, such as exact named entities or concrete events. In this work, we focus on the entity-aware news image captioning task which aims to generate informative captions by leveraging the associated news articles to provide background knowledge about the target image. However, due to the length of news articles, previous works only employ news articles at the coarse article or sentence level, which are not fine-grained enough to refine relevant events and choose named entities accurately. To overcome these limitations, we propose an Information Concentrated Entity-aware news image CAPtioning (ICECAP) model, which progressively concentrates on relevant textual information within the corresponding news article from the sentence level to the word level. Our model first creates coarse concentration on relevant sentences using a cross-modality retrieval model and then generates captions by further concentrating on relevant words within the sentences. Extensive experiments on both BreakingNews and GoodNews datasets demonstrate the effectiveness of our proposed method, which outperforms other state-of-the-arts. The code of ICECAP is publicly available at https://github.com/HAWLYQ/ICECAP.
翻译:目前大多数图像字幕系统都侧重于描述一般图像内容,缺乏深入理解图像的背景知识,例如确切命名的实体或具体事件。在这项工作中,我们侧重于实体了解的新闻图像说明任务,目的是通过利用相关新闻文章提供有关目标图像的背景知识,产生信息性字幕;然而,由于新闻报道篇幅长,以前的作品只使用粗略文章或句级的新闻文章,这些文章不够精细,不足以完善相关事件并准确选择命名实体。为了克服这些限制,我们提议了一个信息集中实体了解的新闻图像解析(ICICECAP)模式,该模式逐步侧重于从句级到字级的相应新闻文章中的相关文本信息。我们的模型首先利用跨模式检索模型粗略地集中相关句子,然后通过在句子内进一步侧重于相关词句子来生成字幕。关于“突破新闻”和“良好新闻”数据集的广泛实验表明我们拟议方法的有效性,它超越了其他状态的艺术解析。ICECAP/CAPICI可以公开查阅 http://GIAPCM/CAPICE。