The use of attention models for automated image captioning has enabled many systems to produce accurate and meaningful descriptions for images. Over the years, many novel approaches have been proposed to enhance the attention process using different feature representations. In this paper, we extend this approach by creating a guided attention network mechanism, that exploits the relationship between the visual scene and text-descriptions using spatial features from the image, high-level information from the topics, and temporal context from caption generation, which are embedded together in an ordered embedding space. A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space to maintain a partial order in the visual-semantic hierarchy and hence, helps the model to produce more visually accurate captions. The experimental results based on MSCOCO dataset shows the competitiveness of our approach, with many state-of-the-art models on various evaluation metrics.
翻译:使用自动图像字幕的注意模型使许多系统能够产生准确和有意义的图像描述。多年来,提出了许多新颖的方法,用不同的特征表示来强化关注过程。在本文件中,我们扩展了这一方法,建立了一个引导关注网络机制,利用图像的空间特征、专题的高层次信息以及字幕生成的时间背景来开发视觉场景和文字描述之间的关系,这些都嵌入一个有秩序的嵌入空间。一个对称排序的目标被用于培训这种嵌入空间,使共同的语义空间能够保持类似的图像、专题和字幕,从而保持视觉和语义结构结构的部分顺序,从而帮助模型制作更准确的视觉字幕。基于MCCO数据集的实验结果显示了我们方法的竞争力,许多最先进的模型都包含在各种评价指标上。