Evaluation metrics are a key ingredient for progress of text generation systems. In recent years, several BERT-based evaluation metrics have been proposed (including BERTScore, MoverScore, BLEURT, etc.) which correlate much better with human assessment of text generation quality than BLEU or ROUGE, invented two decades ago. However, little is known what these metrics, which are based on black-box language model representations, actually capture (it is typically assumed they model semantic similarity). In this work, we \wei{use a simple regression based global explainability technique to} disentangle metric scores along linguistic factors, including semantics, syntax, morphology, and lexical overlap. We show that the different metrics capture all aspects to some degree, but that they are all substantially sensitive to lexical overlap, just like BLEU and ROUGE. This exposes limitations of these novelly proposed metrics, which we also highlight in an adversarial test scenario.
翻译:近些年来,提出了几项基于BERT的基于BERT的评估指标(包括BERTScore、MolerScore、BLEURT等),这些指标与人类对文本生成质量的评估相比,与20年前发明的BLEU或ROUGE相比,与人类对文本生成质量的评估更加相关。然而,很少有人知道这些基于黑箱语言模型表达方式的衡量标准是什么(通常假设它们模拟语义相似性)。在这项工作中,我们使用一种基于简单回归的全球解释技术来分解语言因素,包括语义学、语法、形态学和词汇重叠。我们表明,不同的指标在某种程度上涵盖了所有方面,但它们都对词汇重叠具有高度敏感性,就像BLEU和ROUGE一样。这暴露了这些新提出的指标的局限性,我们也在对抗性测试情景中强调了这一点。