Automatic evaluation remains an open research question in Natural Language Generation. In the context of Sentence Simplification, this is particularly challenging: the task requires by nature to replace complex words with simpler ones that shares the same meaning. This limits the effectiveness of n-gram based metrics like BLEU. Going hand in hand with the recent advances in NLG, new metrics have been proposed, such as BERTScore for Machine Translation. In summarization, the QuestEval metric proposes to automatically compare two texts by questioning them. In this paper, we first propose a simple modification of QuestEval allowing it to tackle Sentence Simplification. We then extensively evaluate the correlations w.r.t. human judgement for several metrics including the recent BERTScore and QuestEval, and show that the latter obtain state-of-the-art correlations, outperforming standard metrics like BLEU and SARI. More importantly, we also show that a large part of the correlations are actually spurious for all the metrics. To investigate this phenomenon further, we release a new corpus of evaluated simplifications, this time not generated by systems but instead, written by humans. This allows us to remove the spurious correlations and draw very different conclusions from the original ones, resulting in a better understanding of these metrics. In particular, we raise concerns about very low correlations for most of traditional metrics. Our results show that the only significant measure of the Meaning Preservation is our adaptation of QuestEval.
翻译:自动评估仍然是自然语言生成的公开研究问题。 在简化判决方面,这特别具有挑战性:任务要求自然地将复杂的词语替换为具有相同含义的更简单的词语。这限制了以正克为基础的指标(如BLEU)的有效性。在与NLG最近的进展同步的同时,还提出了新的指标,如机器翻译的BERTScore。简而言之,QuestEval衡量标准建议通过质询来自动比较两个文本。在本文中,我们首先建议简单修改QuestEval, 允许它处理简化判决。然后我们广泛评估一些指标(如BERTScore和QuestEval)的关联性。这限制了基于nground的量度的有效性。 并表明,后者获得了最新的最新相关性,比BLEUEU和SARI这样的标准差。 更重要的是,我们还表明,大部分的关联性对于所有衡量标准都是虚假的。为了进一步调查这个现象,我们发布了一套评估简化的新的内容,这个时间不是我们所形成的,而是我们所得出的最复杂的标准。