One property that remains lacking in image captions generated by contemporary methods is discriminability: being able to tell two images apart given the caption for one of them. We propose a way to improve this aspect of caption generation. By incorporating into the captioning training objective a loss component directly related to ability (by a machine) to disambiguate image/caption matches, we obtain systems that produce much more discriminative caption, according to human evaluation. Remarkably, our approach leads to improvement in other aspects of generated captions, reflected by a battery of standard scores such as BLEU, SPICE etc. Our approach is modular and can be applied to a variety of model/loss combinations commonly proposed for image captioning.
翻译:当代方法产生的图像字幕中仍然缺乏一个属性是差异性的:能够将两个图像分开,因为其中之一的字幕。我们提出了改进字幕生成这一方面的方法。通过在字幕培训中加入一个直接与(机器)辨别图像/字幕匹配能力相关的损失部分,我们获得了根据人类评价产生更多歧视性字幕的系统。值得注意的是,我们的方法导致生成字幕的其他方面的改进,这体现在一系列标准评分中,如BLEU、SPICE等。我们的方法是模块化的,可以适用于通常为图像字幕提出的各种模型/损失组合。