Natural language processing (NLP) systems are increasingly trained to generate open-ended text rather than classifying between responses. This makes research on evaluation metrics for generated language -- functions that score system output given the context and/or human reference responses -- of critical importance. However, different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others. There is currently no simple, unified way to compare, analyse or evaluate metrics across a representative set of tasks. Here, we describe the Benchmark to Evaluate Automatic Metrics (BEAMetrics), a resource to make research into new metrics itself easier to evaluate. BEAMetrics users can quickly compare existing and new metrics with human judgements across a diverse set of tasks, quality dimensions (fluency vs. coherence vs. informativeness etc), and languages. As generation experts might predict, BEAMetrics reveals stark task-dependent differences between existing metrics, and consistently poor performance on tasks with complex answer spaces or high reliance on general knowledge. While this analysis highlights a critical issue facing current research practice, BEAMetrics also contribute to its resolution by facilitating research into better metrics -- particularly those that can account for the complex interaction between context and general knowledge inherent to many modern NLP applications. BEAMetrics is available under the MIT License: https://github.com/ThomasScialom/BEAMetrics
翻译:自然语言处理系统(NLP)日益受到培训,以生成开放文本,而不是对答复进行分类。这使得对生成语言的评价衡量标准 -- -- 根据背景和(或)人类参考反应,对系统产出进行评分的功能 -- -- 的研究至关重要。然而,不同的衡量标准有不同的长处和偏差,在一些任务中反映人的直觉比其他任务要好。目前没有简单、统一的方法比较、分析或评价具有代表性的任务系列的衡量标准。在这里,我们描述了评估自动计量基准(BEAMetrics),这是使新计量本身的研究更容易评估的一种资源。BEAMricrics用户可以快速将现有和新计量标准与人类判断进行比较,而人类判断则涉及一系列不同的任务、质量层面(宽度相对于一致性相对于信息性等)和语言。正如新一代专家预测的那样,BEAMterrics显示现有计量标准之间的明显差异,以及具有复杂回答空间或高度依赖一般知识的任务的一贯差。尽管这一分析突出了当前研究实践面临的一个关键问题,但BAMricricls也有助于其现代应用的解决方案。