Human ratings are the gold standard in NLG evaluation. The standard protocol is to collect ratings of generated text, average across annotators, and rank NLG systems by their average scores. However, little consideration has been given as to whether this approach faithfully captures human preferences. Analyzing this standard protocol through the lens of utility theory in economics, we identify the implicit assumptions it makes about annotators. These assumptions are often violated in practice, in which case annotator ratings cease to reflect their preferences. The most egregious violations come from using Likert scales, which provably reverse the direction of the true preference in certain cases. We suggest improvements to the standard protocol to make it more theoretically sound, but even in its improved form, it cannot be used to evaluate open-ended tasks like story generation. For the latter, we propose a new human evaluation protocol called $\textit{system-level probabilistic assessment}$ (SPA). When human evaluation of stories is done with SPA, we can recover the ordering of GPT-3 models by size, with statistically significant results. However, when human evaluation is done with the standard protocol, less than half of the expected preferences can be recovered (e.g., there is no significant difference between $\texttt{curie}$ and $\texttt{davinci}$, despite using a highly powered test).
翻译:人类评级是NLG评估中的黄金标准。 标准协议是收集生成文本的评级, 平均的注解者, 按平均得分排列NLG系统。 但是, 很少考虑这一方法是否忠实地捕捉人类的偏好。 通过经济学中的实用理论透镜分析这一标准协议, 我们从经济学中的实用理论的角度来分析它给说明者的隐含的假设。 这些假设在实践上经常被违反, 说明者评级不再反映他们的偏好。 最令人发指的违规来自使用“ 爱” 尺度, 这可以明显地扭转某些情况下真正偏好的方向。 我们建议改进标准协议, 使之在理论上更加健全, 但其改进的形式则无法用来评估像故事生成这样的开放式任务。 对于后者, 我们提出一个新的人类评价协议, 名为 $( textitilititititut{ 系统层面的概率评估 ) $( SPA) 。 当人类对故事的评估完成时, 我们可以按规模恢复GPT-3模型的排序, 具有统计上的重大结果。 但是, 当人类评价与标准协议完成时, $ 美元检验时, 之间没有多大的利 。