We establish THumB, a rubric-based human evaluation protocol for image captioning models. Our scoring rubrics and their definitions are carefully developed based on machine- and human-generated captions on the MSCOCO dataset. Each caption is evaluated along two main dimensions in a tradeoff (precision and recall) as well as other aspects that measure the text quality (fluency, conciseness, and inclusive language). Our evaluations demonstrate several critical problems of the current evaluation practice. Human-generated captions show substantially higher quality than machine-generated ones, especially in coverage of salient information (i.e., recall), while most automatic metrics say the opposite. Our rubric-based results reveal that CLIPScore, a recent metric that uses image features, better correlates with human judgments than conventional text-only metrics because it is more sensitive to recall. We hope that this work will promote a more transparent evaluation protocol for image captioning and its automatic metrics.
翻译:我们为图像字幕模型建立了THumb, 这是一种基于标本的人类评价程序。 我们的评分标注及其定义是根据机器和人类在MCCO数据集上产生的字幕精心制定的。 每个标题都按照取舍(精度和回顾)中的两个主要方面以及衡量文本质量的其他方面(流利、简洁和包容性语言)进行评估。我们的评价显示了当前评价做法中的若干关键问题。 人类产生的字幕显示的质量大大高于机器生成的标本, 特别是在突出信息( 记得) 的覆盖面方面, 而大多数自动指标则相反。 我们基于标本的结果表明, CLIPScore是使用图像特征的最新指标,它与人类判断比传统仅使用文本的衡量标准更相关,因为它更加敏感。 我们希望这项工作将促进一个更透明的图像字幕及其自动衡量标准评价协议。