GENIE:努力为编制文本进行可复制和标准化的人力评价 (GENIE: Toward Reproducible and Standardized Human Evaluation for Text Generation)

While often assumed a gold standard, effective human evaluation of text generation remains an important, open area for research. We revisit this problem with a focus on producing consistent evaluations that are reproducible -- over time and across different populations. We study this goal in different stages of the human evaluation pipeline. In particular, we consider design choices for the annotation interface used to elicit human judgments and their impact on reproducibility. Furthermore, we develop an automated mechanism for maintaining annotator quality via a probabilistic model that detects and excludes noisy annotators. Putting these lessons together, we introduce GENIE: a system for running standardized human evaluations across different generation tasks. We instantiate GENIE with datasets representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. For each task, GENIE offers a leaderboard that automatically crowdsources annotations for submissions, evaluating them along axes such as correctness, conciseness, and fluency. We have made the GENIE leaderboards publicly available, and have already ranked 50 submissions from 10 different research groups. We hope GENIE encourages further progress toward effective, standardized evaluations for text generation.

翻译：虽然人们往往假定了一种黄金标准,但是对文本生成的有效人力评价仍然是一个重要的、开放的研究领域。我们重新研究这一问题,重点是进行可复制的、经过一段时间和不同人口群体之间的一致评价。我们在人类评价管道的不同阶段研究这一目标。特别是,我们考虑设计用于征求人类判断及其对复制的影响的批注界面的选择。此外,我们开发了一个自动机制,通过一种探测和排除噪音标语的概率模型来保持标语质量。把这些经验教训放在一起,我们引入了GENIE:一个在不同代际任务中进行标准化人类评价的系统。我们即时化GENIE, 其数据集代表了生成文本方面的四项核心挑战:机器翻译、合成、常识推理和机理解。对于每一项任务,GENIE都提供自动的众源说明,按照正确性、简洁性和流畅度等轴线来评估。我们让GENIE领导人板公开使用,并且已经将10个不同研究团体的50份提交材料排在了50份上。我们希望GENIE鼓励进一步标准化的版本。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

50+阅读 · 2022年10月2日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日