Image captioning approaches currently generate descriptions which lack specific information, such as named entities that are involved in the images. In this paper we propose a new task which aims to generate informative image captions, given images and hashtags as input. We propose a simple, but effective approach in which we, first, train a CNN-LSTM model to generate a template caption based on the input image. Then we use a knowledge graph based collective inference algorithm to fill in the template with specific named entities retrieved via the hashtags. Experiments on a new benchmark dataset collected from Flickr show that our model generates news-style image descriptions with much richer information. The METEOR score of our model almost triples the score of the baseline image captioning model on our benchmark dataset, from 4.8 to 13.60.
翻译:图像说明方法目前产生缺乏具体信息的描述, 例如参与图像的命名实体。 在本文中, 我们提议一项新的任务, 目的是生成信息化图像说明, 给图像和标签提供输入。 我们提出一个简单但有效的方法, 首先我们训练CNN- LSTM 模型, 以输入图像为基础生成一个模板说明。 然后我们使用基于集体推断算法的知识图形, 与通过标签检索的具体名称实体一起填充模板。 实验从Flickr收集的新基准数据集显示, 我们的模型生成了以信息更丰富得多的信息的以新闻形式图像描述。 我们模型的METEOR得分几乎是基准数据集基准图像说明模型的分数的三倍, 从4. 8 到13. 60 。