We propose Visual News Captioner, an entity-aware model for the task of news image captioning. We also introduce Visual News, a large-scale benchmark consisting of more than one million news images along with associated news articles, image captions, author information, and other metadata. Unlike the standard image captioning task, news images depict situations where people, locations, and events are of paramount importance. Our proposed method can effectively combine visual and textual features to generate captions with richer information such as events and entities. More specifically, built upon the Transformer architecture, our model is further equipped with novel multi-modal feature fusion techniques and attention mechanisms, which are designed to generate named entities more accurately. Our method utilizes much fewer parameters while achieving slightly better prediction results than competing methods. Our larger and more diverse Visual News dataset further highlights the remaining challenges in captioning news images.
翻译:我们提出“视觉新闻说明”模式,这是用于新闻图像说明任务的一个实体认知模型。我们还引入了“视觉新闻”,这是一个大型基准,由100多万个新闻图像以及相关新闻文章、图像说明、作者信息和其他元数据组成。与标准的图像说明任务不同,新闻图像描述了人、地点和事件至关重要的情况。我们提出的方法可以有效地将视觉和文字功能结合起来,以更丰富的信息生成字幕,如事件和实体。更具体地说,在“变形”结构上,我们的模型还配备了新的多模式特征聚合技术和关注机制,这些技术和机制旨在更准确地生成被命名的实体。我们的方法利用的参数比相竞方法少得多,同时取得稍好得多的预测结果。我们更庞大、更多样化的视觉新闻数据集进一步凸显了在描述新闻图像方面尚存的挑战。