Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations, and Recurrent Neural Networks to generate the corresponding captions. At the same time, a significant research effort has been dedicated to the development of saliency prediction models, which can predict human eye fixations. Even though saliency information could be useful to condition an image captioning architecture, by providing an indication of what is salient and what is not, research is still struggling to incorporate these two techniques. In this work, we propose an image captioning approach in which a generative recurrent neural network can focus on different parts of the input image during the generation of the caption, by exploiting the conditioning given by a saliency prediction model on which parts of the image are salient and which are contextual. We show, through extensive quantitative and qualitative experiments on large scale datasets, that our model achieves superior performances with respect to captioning baselines with and without saliency, and to different state of the art approaches combining saliency and captioning.
翻译:最近,由于深层字幕结构所展示的令人印象深刻的成就,图像说明最近引起了人们的极大关注。 深层字幕结构将进化神经网络和常规神经网络结合起来,以提取图像图示,并经常出现神经网络来生成相应的字幕。 同时,一项重大的研究努力致力于开发突出的预测模型,这些模型可以预测人类眼部的固定性。尽管突出的信息可以有助于为图像说明结构设定条件,通过提供突出和不突出的标志,研究仍在努力纳入这两种技术。在这项工作中,我们建议采用一种图像说明方法,在生成字幕时,通过利用突出的预测模型所提供的条件,将输入图像中哪些部分是显著的,哪些是背景的。我们通过大规模数据集的广泛定量和定性实验,显示我们的模型在标注基线时取得了优异的成绩,而没有突出和说明性,以及将突出和说明性结合起来的艺术方法的不同状态。