Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks that can be reliably evaluated in an automatic manner. This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms asking human annotators to evaluate them on various axes (e.g., correctness, conciseness, fluency) and compares their answers to various automatic metrics. We introduce several datasets in English to GENIE, representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. We provide formal granular evaluation metrics and identify areas for future research. We make GENIE publicly available and hope that it will spur progress in language generation models as well as their automatic and manual evaluation.
翻译:领导板通过使其评价标准化并将其委托给独立的外部储存库,使许多国家劳工政策数据集的模型发展更加容易。但是,其采用到目前为止仅限于可以自动可靠评价的任务。这项工作引入了可扩展的人类评价领头板GENIE, 使领导板容易生成文本任务。GENIE自动将领导板张贴到众包平台,请人贩子对各种轴(如正确性、简洁性、流畅性)进行评估,并比较其对各种自动指标的回答。我们向GENIE引入了几个英文数据集,这在文本生成方面代表了四个核心挑战:机器翻译、总结、常识推理和机器理解。我们提供了正式的微粒评价指标,并确定了未来研究的领域。我们让GENIE公开提供这种指标,希望它将推动语言生成模型及其自动和人工评价的进展。