While often assumed a gold standard, effective human evaluation of text generation remains an important, open area for research. We revisit this problem with a focus on producing consistent evaluations that are reproducible -- over time and across different populations. We study this goal in different stages of the human evaluation pipeline. In particular, we consider design choices for the annotation interface used to elicit human judgments and their impact on reproducibility. Furthermore, we develop an automated mechanism for maintaining annotator quality via a probabilistic model that detects and excludes noisy annotators. Putting these lessons together, we introduce GENIE: a system for running standardized human evaluations across different generation tasks. We instantiate GENIE with datasets representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. For each task, GENIE offers a leaderboard that automatically crowdsources annotations for submissions, evaluating them along axes such as correctness, conciseness, and fluency. We have made the GENIE leaderboards publicly available, and have already ranked 50 submissions from 10 different research groups. We hope GENIE encourages further progress toward effective, standardized evaluations for text generation.
翻译:虽然人们往往假定了一种黄金标准,但是对文本生成的有效人力评价仍然是一个重要的、开放的研究领域。我们重新研究这一问题,重点是进行可复制的、经过一段时间和不同人口群体之间的一致评价。我们在人类评价管道的不同阶段研究这一目标。特别是,我们考虑设计用于征求人类判断及其对复制的影响的批注界面的选择。此外,我们开发了一个自动机制,通过一种探测和排除噪音标语的概率模型来保持标语质量。把这些经验教训放在一起,我们引入了GENIE:一个在不同代际任务中进行标准化人类评价的系统。我们即时化GENIE, 其数据集代表了生成文本方面的四项核心挑战:机器翻译、合成、常识推理和机理解。对于每一项任务,GENIE都提供自动的众源说明,按照正确性、简洁性和流畅度等轴线来评估。我们让GENIE领导人板公开使用,并且已经将10个不同研究团体的50份提交材料排在了50份上。我们希望GENIE鼓励进一步标准化的版本。