二维领导板:手持语言生成和评价 (Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand)

Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to focus on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards, bidimensional leaderboards (Billboards), that simultaneously tracks progress in language generation tasks and metrics for their evaluation. Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries. A Billboard automatically creates an ensemble metric that selects and linearly combines a few metrics based on a global analysis across generators. Further, metrics are ranked based on their correlations with human judgments. We release four Billboards for machine translation, summarization, and image captioning. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation. Our mixed-effects model analysis shows that most automatic metrics, especially the reference-based ones, overrate machine over human generation, demonstrating the importance of updating metrics as generation models become stronger (and perhaps more similar to humans) in the future.

翻译：自然语言处理研究者已经查明了生产任务评价方法的局限性,并提出了关于自动计量和众工判断的有效性的新问题。与此同时,改进生成模型的努力往往侧重于简单的ngg重叠度(例如BLEU、ROUGE),我们认为,模型和计量方面的新进展应更直接地使每个模型和计量都受益,并告知其他指标。因此,我们建议对领导板、双维领导板(Billboards)进行概括化,同时跟踪语言生成任务和衡量标准的进展,以供其评价。不同于传统的单维领导板,该标准以预先设定的计量标准的形式提交系统,一个比尔板接受生成和评价指标作为竞争条目。一个标牌自动创建一个集,根据对各种发电机进行的全球分析,选择和线性地将几个指标结合起来。此外,根据它们与人类判断的关联性,我们发布了四个广告牌,用于机器翻译、合成和图像描述。我们展示了少数不同指标的在线组合,有时大大超出现有指标的系统,将生成数据作为竞争条目作为竞争的条目。我们混合的模型展示了更强的模型,成为了更强的模型。

相关内容

MoDELS

关注 40

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【新书】感知和行动的贝叶斯模型，348页pdf

专知会员服务

74+阅读 · 2021年11月18日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

128+阅读 · 2020年7月18日

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

专知会员服务

109+阅读 · 2020年6月10日

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

专知会员服务

97+阅读 · 2020年4月10日