Is it possible to build a general and automatic natural language generation (NLG) evaluation metric? Existing learned metrics either perform unsatisfactorily or are restricted to tasks where large human rating data is already available. We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation, by utilizing a novel, iterative error synthesis and severity scoring pipeline. This pipeline applies a series of plausible errors to raw text and assigns severity labels by simulating human judgements with entailment. We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings. SESCORE outperforms all prior unsupervised metrics on multiple diverse NLG tasks including machine translation, image captioning, and WebNLG text generation. For WMT 20/21 En-De and Zh-En, SESCORE improve the average Kendall correlation with human judgement from 0.154 to 0.195. SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
翻译:能否建立通用和自动的自然语言生成(NLG)评估指标?现有的学习指标要么表现不令人满意,要么局限于已经存在大量人类评级数据的任务。我们引入了SESCORE,这是一个与人类判断高度相关的模型性指标,不要求人工批注,方法是利用一种新颖的、迭代错误合成和重度评分管道。这个管道对原始文本应用一系列可信的错误,并通过模拟人类判断和随之产生的重度标签。我们通过比较其分数与人类评级的关系,对照现有指标来评估SESCORE。SCORE在包括机器翻译、图像说明和WebNLG文本生成在内的多种非监督性指标上优于所有以前未受监督的。对于WMT 20/21 En-De和Zh-En,SESCORE改进了与人类判断的平均肯德尔关系,从0.14到0.195。SESCORE甚至取得了与最受监督的NDOCT指标相似的业绩,尽管没有人文注解的培训数据。