We establish a rubric-based human evaluation protocol for image captioning models. Our scoring rubrics and their definitions are carefully developed based on machine- and human-generated captions on the MSCOCO dataset. Each caption is evaluated along two main dimensions in a tradeoff (precision and recall) as well as other aspects that measure the text quality (fluency, conciseness, and inclusive language). Our evaluations demonstrate several critical problems of the current evaluation practice. Human-generated captions show substantially higher quality than machine-generated ones, especially in coverage of salient information (i.e., recall), while all automatic metrics say the opposite. Our rubric-based results reveal that CLIPScore, a recent metric that uses image features, better correlates with human judgments than conventional text-only metrics because it is more sensitive to recall. We hope that this work will promote a more transparent evaluation protocol for image captioning and its automatic metrics.
翻译:我们为图像字幕模型制定了基于标本的人类评价协议。我们的评分标注及其定义是根据机器和人类在MCCO数据集上产生的字幕精心制定的。每个标题都按照取舍(精度和回溯)中的两个主要方面以及衡量文本质量的其他方面(流利、简洁和包容性语言)进行评估。我们的评价显示了当前评价做法的若干关键问题。人类产生的字幕质量大大高于机器产生的字幕,特别是在显要信息(即回顾)的覆盖面方面,而所有自动计量则相反。我们基于标本的结果表明,CLIPScore是使用图像特征的近期指标,它与人类的判断比传统的仅使用文本的衡量标准更相关,因为它更敏感。我们希望这项工作将促进一个更透明的图像字幕及其自动计量评价协议。