Despite continuously improving performance, contemporary image captioning models are prone to "hallucinating" objects that are not actually in a scene. One problem is that standard metrics only measure similarity to ground truth captions and may not fully capture image relevance. In this work, we propose a new image relevance metric to evaluate current models with veridical visual labels and assess their rate of object hallucination. We analyze how captioning model architectures and learning objectives contribute to object hallucination, explore when hallucination is likely due to image misclassification or language priors, and assess how well current sentence metrics capture object hallucination. We investigate these questions on the standard image captioning benchmark, MSCOCO, using a diverse set of models. Our analysis yields several interesting findings, including that models which score best on standard sentence metrics do not always have lower hallucination and that models which hallucinate more tend to make errors driven by language priors.
翻译:尽管性能在不断改善,当代图像字幕模型仍然容易“催化”并非实际处于场景中的物体。一个问题在于标准指标只衡量与地面真相字幕相似之处,可能无法完全捕捉到图像相关性。在这项工作中,我们提出了一个新的图像相关性指标,以评价带有天体视觉标签的当前模型并评估其对象幻觉率。我们分析了字幕模型结构和学习目标如何促进目标幻觉,探索幻觉可能因图像分类错误或语言前科而产生,并评估当前判决指标捕捉对象幻觉的情况有多好。我们用一套不同的模型对标准图像字幕基准MSCO的这些问题进行了调查。我们的分析得出了几个有趣的结果,包括那些在标准参数上得分最好的模型并不总是有较低的幻觉,而那些产生幻觉的模型更倾向于由语言前科驱动的错误。