The aim of image captioning is to generate captions by machine to describe image contents. Despite many efforts, generating discriminative captions for images remains non-trivial. Most traditional approaches imitate the language structure patterns, thus tend to fall into a stereotype of replicating frequent phrases or sentences and neglect unique aspects of each image. In this work, we propose an image captioning framework with a self-retrieval module as training guidance, which encourages generating discriminative captions. It brings unique advantages: (1) the self-retrieval guidance can act as a metric and an evaluator of caption discriminativeness to assure the quality of generated captions. (2) The correspondence between generated captions and images are naturally incorporated in the generation process without human annotations, and hence our approach could utilize a large amount of unlabeled images to boost captioning performance with no additional laborious annotations. We demonstrate the effectiveness of the proposed retrieval-guided method on COCO and Flickr30k captioning datasets, and show its superior captioning performance with more discriminative captions.
翻译:图像字幕的目的是通过机器生成字幕来描述图像内容。尽管做出了许多努力,但为图像生成歧视性字幕仍然是非三重性的。大多数传统方法模仿语言结构模式,因此往往形成复制常用短语或句子的陈规定型,忽视每个图像的独特性。在这项工作中,我们提出一个带有自我检索模块的图像字幕框架,作为培训指南,鼓励生成歧视性字幕。它带来了独特的优势:(1)自搜索指南可以作为描述歧视性字幕的量度和评估者,以确保生成的字幕的质量。 (2)生成的字幕和图像之间的对应自然融入生成过程,而没有人类的注释,因此,我们的方法可以使用大量无标签的图像,在没有额外劳动说明的情况下提高字幕的性能。我们展示了拟议的COCO和Flick30k字幕数据集检索制导法的有效性,并以更具歧视性的字幕显示其优异性字幕的性能。