Recent text generation research has increasingly focused on open-ended domains such as story and poetry generation. Because models built for such tasks are difficult to evaluate automatically, most researchers in the space justify their modeling choices by collecting crowdsourced human judgments of text quality (e.g., Likert scores of coherence or grammaticality) from Amazon Mechanical Turk (AMT). In this paper, we first conduct a survey of 45 open-ended text generation papers and find that the vast majority of them fail to report crucial details about their AMT tasks, hindering reproducibility. We then run a series of story evaluation experiments with both AMT workers and English teachers and discover that even with strict qualification filters, AMT workers (unlike teachers) fail to distinguish between model-generated text and human-generated references. We show that AMT worker judgments improve when they are shown model-generated output alongside human-generated references, which enables the workers to better calibrate their ratings. Finally, interviews with the English teachers provide deeper insights into the challenges of the evaluation process, particularly when rating model-generated text.
翻译:最近生成的文本研究越来越集中在诸如故事和诗歌创作等开放领域。由于为此类任务建立的模式难以自动评估,大多数空间研究人员通过收集亚马逊机械土耳其(Amazon Mechanical Turk)的文本质量(如一致性或语法学学学分)的多方来源人类判断(如相似分数)来证明其建模选择的合理性。在本文中,我们首先对45份不开放的文本生成文件进行调查,发现其中绝大多数人没有报告有关其AMT任务的关键细节,从而妨碍其再生。我们随后与AMT工人和英语教师进行了一系列的故事评估实验,发现AMT工人(与教师不同)即使在严格的资质过滤器下,也无法区分模型生成的文本和人类生成的参考资料。我们表明,AMT工人的判断在显示模型生成的产出时会与人类生成的参考资料一起得到改进,从而使工人能够更好地校准其评级。最后,与英国教师的访谈能更深入地了解评估过程的挑战,特别是在评级模型生成的文本时。