Assessing the quality of natural language generation systems through human annotation is very expensive. Additionally, human annotation campaigns are time-consuming and include non-reusable human labour. In practice, researchers rely on automatic metrics as a proxy of quality. In the last decade, many string-based metrics (e.g., BLEU) have been introduced. However, such metrics usually rely on exact matches and thus, do not robustly handle synonyms. In this paper, we introduce InfoLM a family of untrained metrics that can be viewed as a string-based metric that addresses the aforementioned flaws thanks to a pre-trained masked language model. This family of metrics also makes use of information measures allowing the adaptation of InfoLM to various evaluation criteria. Using direct assessment, we demonstrate that InfoLM achieves statistically significant improvement and over $10$ points of correlation gains in many configurations on both summarization and data2text generation.
翻译:评估自然语言生成系统通过人类注释评估质量的费用非常昂贵,此外,人类笔记活动耗时费时,包括不可再利用的人类劳动。实际上,研究人员依靠自动测量作为质量的替代物。在过去的十年中,采用了许多基于字符串的测量法(如BLEU),但是,这类测量法通常依赖精确匹配,因此,不能强有力地处理同义词。在本文中,我们引入了InfoLM系列的未经培训的测量法,可以被视为一种基于字符串的测量法,通过预先训练的蒙面语言模型解决上述缺陷。这一类测量法还使用信息措施,使InfoLM适应各种评估标准。我们通过直接评估,表明InfoLM在许多组合中,在总和和生成和数据2文本生成方面,在统计上都取得了显著的改进和超过1 000美元的相关收益。