The aim of image captioning is to generate similar captions by machine as human do to describe image contents. Despite many efforts, generating discriminative captions for images remains non-trivial. Most traditional approaches imitate the language structure patterns, thus tend to fall into a stereotype of replicating frequent phrases or sentences and neglect unique aspects of each image. In this work, we propose an image captioning framework with a self-retrieval module as training guidance, which encourages generating discriminative captions. It brings unique advantages: (1) the self-retrieval guidance can act as a metric and an evaluator of caption discriminativeness to assure the quality of generated captions. (2) The correspondence between generated captions and images are naturally incorporated in the generation process without human annotations, and hence our approach could utilize a large amount of unlabeled images to boost captioning performance with no additional laborious annotations. We demonstrate the effectiveness of the proposed retrieval-guided method on MS-COCO and Flickr30k captioning datasets, and show its superior captioning performance with more discriminative captions.
翻译:图像字幕的目的是通过机器产生与人类相似的字幕来描述图像内容。尽管做了许多努力,但为图像生成的歧视性字幕仍然是非三重性的。大多数传统方法模仿语言结构模式,因此往往形成复制频繁的短语或句子的定型观念,忽视每个图像的独特方面。在这项工作中,我们提出了一个图像字幕框架,以自我检索模块作为培训指南,鼓励产生歧视性字幕。它带来了独特的优势:(1)自检索指南可以作为说明歧视的衡量标准和评价者,确保生成的字幕的质量。 (2) 生成的字幕和图像之间的对应自然融入生成过程,没有人类的注释,因此我们的方法可以使用大量无标签的图像来提高字幕的性能,而没有额外的艰苦说明。我们展示了在MS-COCO和Flick30k描述数据集上拟议的检索指导方法的有效性,并以更具有歧视性的字幕展示其优异的字幕性能。