Recently, the state-of-the-art models for image captioning have overtaken human performance based on the most popular metrics, such as BLEU, METEOR, ROUGE, and CIDEr. Does this mean we have solved the task of image captioning? The above metrics only measure the similarity of the generated caption to the human annotations, which reflects its accuracy. However, an image contains many concepts and multiple levels of detail, and thus there is a variety of captions that express different concepts and details that might be interesting for different humans. Therefore only evaluating accuracy is not sufficient for measuring the performance of captioning models --- the diversity of the generated captions should also be considered. In this paper, we proposed a new metric for measuring the diversity of image captions, which is derived from latent semantic analysis and kernelized to use CIDEr similarity. We conduct extensive experiments to re-evaluate recent captioning models in the context of both diversity and accuracy. We find that there is still a large gap between the model and human performance in terms of both accuracy and diversity and the models that have optimized accuracy (CIDEr) have low diversity. We also show that balancing the cross-entropy loss and CIDEr reward in reinforcement learning during training can effectively control the tradeoff between diversity and accuracy of the generated captions.
翻译:最近,最新最先进的图像字幕模型基于最受欢迎的指标,如BLEU、METEOR、REGE、ROUGE和CIDER,超过了人类的性能。这是否意味着我们已经完成了图像字幕的任务?上述指标仅衡量生成的字幕与人文说明的相似性,这反映了其准确性。然而,图像包含许多概念和多层次的细节,因此有各种说明,表达了不同概念和细节,对不同的人来说可能很有意思。因此,仅评价准确性不足以衡量字幕模型的性能 -- -- 所生成的字幕的多样性也值得考虑。在本文件中,我们提出了衡量图像字幕多样性的新指标,该指标来自潜在的语义分析,并被内嵌以使用CIDER的相似性。我们进行了广泛的实验,从多样性和准确性的角度重新评价最近的字幕模型。我们发现,模型和人性绩效在准确性和多样性方面仍然有很大差距,而且模型具有优化的准确性(CIDE)生成的准确性,我们还可以在CREDRED记录中进行最佳的准确性研究。