Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to focus on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards, bidimensional leaderboards (Billboards), that simultaneously tracks progress in language generation tasks and metrics for their evaluation. Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries. A Billboard automatically creates an ensemble metric that selects and linearly combines a few metrics based on a global analysis across generators. Further, metrics are ranked based on their correlations with human judgments. We release four Billboards for machine translation, summarization, and image captioning. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation. Our mixed-effects model analysis shows that most automatic metrics, especially the reference-based ones, overrate machine over human generation, demonstrating the importance of updating metrics as generation models become stronger (and perhaps more similar to humans) in the future.
翻译:自然语言处理研究者已经查明了生产任务评价方法的局限性,并提出了关于自动计量和众工判断的有效性的新问题。与此同时,改进生成模型的努力往往侧重于简单的ngg重叠度(例如BLEU、ROUGE),我们认为,模型和计量方面的新进展应更直接地使每个模型和计量都受益,并告知其他指标。因此,我们建议对领导板、双维领导板(Billboards)进行概括化,同时跟踪语言生成任务和衡量标准的进展,以供其评价。不同于传统的单维领导板,该标准以预先设定的计量标准的形式提交系统,一个比尔板接受生成和评价指标作为竞争条目。一个标牌自动创建一个集,根据对各种发电机进行的全球分析,选择和线性地将几个指标结合起来。此外,根据它们与人类判断的关联性,我们发布了四个广告牌,用于机器翻译、合成和图像描述。我们展示了少数不同指标的在线组合,有时大大超出现有指标的系统,将生成数据作为竞争条目作为竞争的条目。我们混合的模型展示了更强的模型,成为了更强的模型。